User:AKhatun/Wikidata Basic Analysis
The following analysis is done on Wikidata as a means to understand more about Wikidata itself. This includes what kind of subjects, properties, objects etc does it contain the most, the type of triples it contains, how much of it refers to wiki or non-wiki objects etc. The Analysis was done on Wikidata snapshot 20210614
using Python in Jupyter Notebook. Other packages used are: Spark, RDFLib, SPARQLWrapper, and Pandas. Part of some analysis collected data with SPARQL from WDQS endpoint, which fetches the latest data (to easily get labels and data types of literals for example). So a small difference is sometimes found with the snapshot and latest data.
The Wikidata prefixes list can be found here: Full_list_of_prefixes
Phabricator Ticket: T282139
Jupyter Notebook: Wikidata Analysis Notebook
Overview
As of 20210614
:
- Total number of triples: 12,910,066,145 (12.9B)
- Total number of distinct items (context): 97,315,151
- Total number of distinct predicates: 41,117
- Total number of triples related to references: 379,164,793 (379M, 2.9%)
- Total number of references: 90,062,598 (90M)
- Total triples with value node as subject: 279,313,267 (279M, 2.2%)
- Total distinct value nodes: 61,518,273 (61M)
- Number of triples with Wiki objects: 6,828,025,880 (6.8B, 52.9%)
- Number of triples with Non-Wiki objects: 5,843,911,006 (5.8B, 45.3%)
Items
How many different things does Wikidata talk about? It's a very high-level overview question and answered based on the context
from Wikidata. For example, all triples under Q42 context can be found here: Q42 dump. Top 20 items are shown in the table below.
Item | Item Label | Count |
---|---|---|
wd:Q39790431 | BayGenomics: a resource of insertional mutations in mouse embryonic stem cells | 41847 |
wd:Q57661806 | Erratum to: Search for supersymmetry in events containing a same-flavour opposite-sign dilepton pair, jets, and large missing transverse momentum in π πππ‘π =8 s = 8 TeV pp collisions with the ATLAS detector | 34517 |
wd:Q56836084 | 40 EASD Annual Meeting of the European Association for the Study of Diabetes : Munich, Germany, 5-9 September 2004 | 33299 |
wd:Q64022985 | Combinations of single-top-quark production cross-section measurements and fLVVtb determinations at s π πππ‘π = 7 and 8 TeV with the ATLAS and CMS experiments | 32078 |
wd:Q21558717 | Combined Measurement of the Higgs Boson Mass in p p Collisions at s = 7 and 8\u00A0TeV with the ATLAS and CMS Experiments | 31791 |
wd:Q56754739 | Measurements of the Higgs boson production and decay rates and constraints on its couplings from a combined ATLAS and CMS analysis of the LHC pp collision data at s = 7 π πππ‘π =7 and 8 TeV | 31653 |
wd:Q56895655 | Combination of inclusive and differential t t \u00AF πππ‘βπππ‘ππ£πππππππππ‘βπππ‘ charge asymmetry measurements using ATLAS and CMS data at s = 7 π πππ‘π =7 and 8 TeV | 31562 |
wd:Q57920219 | 35th Annual Meeting of the European Association for the Study of Diabetes | 27656 |
wd:Q56883844 | 35th Annual Meeting of the European Association for the Study of Diabetes : Brussels, Belgium, 28 September-2 October 1999 | 27632 |
wd:Q57735077 | ABSTRACTS | 27267 |
wd:Q56489295 | Search for supersymmetry in events containing a same-flavour opposite-sign dilepton pair, jets, and large missing transverse momentum in [Formula: see text] TeV collisions with the ATLAS detector | 24994 |
wd:Q93740619 | XXIV World Allergy Congress 2015: Seoul, Korea. 14-17 October 2015 | 21491 |
wd:Q21521425 | Charged-particle multiplicities in pp interactions at root s=900 GeV measured with the ATLAS detector at the LHC ATLAS Collaboration | 19904 |
wd:Q56289397 | Performance of the ATLAS detector using first collision data | 19722 |
wd:Q57018684 | Measurement of the W \u2192 \u2113\u03BD and Z/\u03B3 * \u2192 \u2113\u2113 production cross sections in proton-proton collisions at \u221As = 7 TeV with the ATLAS detector | 19692 |
wd:Q57018057 | Measurement of inclusive jet and dijet cross sections in proton-proton collisions at 7 TeV centre-of-mass energy with the ATLAS detector | 19689 |
wd:Q56501626 | Search for new particles in two-jet final states in 7 TeV proton-proton collisions with the ATLAS detector at the LHC | 19640 |
wd:Q21521423 | Search for quark contact interactions in dijet angular distributions in pp collisions at root s=7 TeV measured with the ATLAS detector | 19635 |
wd:Q57016199 | Search for heavy vector-like quarks coupling to light quarks in proton\u2013proton collisions at s = 7 TeV with the ATLAS detector | 19276 |
wd:Q57661921 | Erratum to: \u201CSearch for first generation scalar leptoquarks in pp collisions at s = 7 TeV with the ATLAS detector\u201D [Phys. Lett. B 709 (2012) 158] | 19231 |
- Total number of distinct items (context): 97315151 (0.75% of total triples)
- Top 50 item means items that have the most related triples
- All of the top 50 seem to be scholarly articles
- These have *lots* of authors and more related information about the authors as statements
Top Subjects
Once again it seems top subjects are scholarly articles.
Subject | Subject Label | Count |
---|---|---|
wd:Q39790431 | BayGenomics: a resource of insertional mutations in mouse embryonic stem cells | 16758 |
wd:Q57661806 | Erratum to: Search for supersymmetry in events containing a same-flavour opposite-sign dilepton... | 11371 |
wd:Q56836084 | 40 EASD Annual Meeting of the European Association for the Study of Diabetes : Munich, Germany,... | 11054 |
wd:Q64022985 | Combinations of single-top-quark production cross-section measurements and fLVVtb determinati... | 10460 |
wd:Q21558717 | Combined Measurement of the Higgs Boson Mass in p p Collisions at s = 7 and 8\u00A0TeV wi... | 10351 |
wd:Q106988069 | Combined Measurement of the Higgs Boson Mass in pp Collisions at \u221As=7 and 8 TeV with the A... | 10338 |
wd:Q56754739 | Measurements of the Higgs boson production and decay rates and constraints on its couplings fro... | 10285 |
wd:Q56895655 | Combination of inclusive and differential t t \u00AF πππ‘βπππ‘ππ£πππππππππ‘βπππ‘ c... | 10254 |
wd:Q58231267 | Erratum to: 36th International Symposium on Intensive Care and Emergency Medicine | 9861 |
wd:Q57920219 | 35th Annual Meeting of the European Association for the Study of Diabetes | 9191 |
wd:Q56883844 | 35th Annual Meeting of the European Association for the Study of Diabetes : Brussels, Belgium, ... | 9187 |
wd:Q57735077 | ABSTRACTS | 9092 |
wd:Q56489295 | Search for supersymmetry in events containing a same-flavour opposite-sign dilepton pair, jets,... | 8145 |
wd:Q21521425 | Charged-particle multiplicities in pp interactions at root s=900 GeV measured with the ATLAS de... | 6543 |
wd:Q21521423 | Search for quark contact interactions in dijet angular distributions in pp collisions at root s... | 6454 |
wd:Q57018684 | Measurement of the W \u2192 \u2113\u03BD and Z/\u03B3 * \u2192 \u2113\u2113 production cross se... | 6446 |
wd:Q56289397 | Performance of the ATLAS detector using first collision data | 6435 |
wd:Q57018057 | Measurement of inclusive jet and dijet cross sections in proton-proton collisions at 7 TeV cent... | 6426 |
wd:Q56501626 | Search for new particles in two-jet final states in 7 TeV proton-proton collisions with the ATL... | 6421 |
wd:Q27972199 | ESG: extended similarity group method for automated protein function prediction | 6292 |
References
Top references (removing the duplicates). Top references as subjects are the ones that have the most triples associated with them (i.e refs as subjects). Reference usage is counted by considering them object (usage count in the table). The reference with the most triples are not necessairly the most used ones. Top references as object are the ones that are used the most.
Analysis Note: Reference and Values have duplicates in hdfs due to the dumping process. In real triple store, they are deduplicated. So from here on, for references and values, all distinct triples are considered.
- Total number of triples related to references (count of triples in reference context): 379164793 (379M)
- Total number of references: 90062598 (90M)
Reference | triple count | usage count |
---|---|---|
ref:07b55fdfe5fba3fda539dbefb7196da0e3460d2b | 102 | 1 |
ref:68e48339e339a3bda7932cac38f44abe27de1461 | 35 | 1 |
ref:703f0d28768bd798064c25fcdce64ea5dfbd6c5a | 33 | 1 |
ref:35ff7a307543d079cc224bb7aa75ef02a164049f | 28 | 1 |
ref:d2658c2ffc4a87017867dffe00c3cccc64f6a131 | 27 | 1 |
ref:c892725170c4c673767355f2581286c675613844 | 26 | 1 |
ref:c7105386906164ed1a2e4ef334b43e9f00c00157 | 26 | 1 |
ref:5cb21fcb42c03830f7125eaa545e577361c2f9ef | 25 | 1 |
ref:7e1244220f770f53ec309f3dce0845f990959d7d | 25 | 1 |
ref:872a839ab7777797a4a498442811816c70025da5 | 24 | 1 |
ref:55ee45a8d9f9cc0fad2cae61f5e42aced44261e0 | 24 | 1 |
ref:426796f41cc0666ac881b1f42501cbdb0064e976 | 24 | 1 |
ref:ff4ad8769bd82d915b6c6e5f2004f13b57efc5ff | 22 | 1 |
ref:a0ea572733723ae44d5d3c10cea8a79e9e67e7da | 21 | 1 |
ref:b9ca90f1e1de79de773a3a7f3f6f014ade3ca397 | 21 | 1 |
ref:7c4655b9fadcc3751795f4fc610854826e2095a1 | 20 | 1 |
ref:8e260ab6e7cd618239354955d7c86558ea9992aa | 20 | 1 |
ref:dfeadb7be3fd743c77af182ad62cf834c89587bd | 20 | 1 |
ref:ce66538f0ea508e1ad69004f962ba53c5b7ed05a | 20 | 1 |
ref:d8488d862542e7169f3c77b836caa1274c959e8c | 20 | 1 |
Reference | triple count | usage count |
---|---|---|
ref:8ba559d5760a03bedaaacc3c347bbfe4981560bf | 1 | 46222198 |
ref:b64af6c056b6c5f6a7ea17156dcd718d4744bbf8 | 1 | 32783765 |
ref:fa278ebfc458360e5aed63d5058cca83c46134f1 | 1 | 14391465 |
ref:6b647975ae22e206a4cd711623ecb06abadbdb9e | 1 | 10767806 |
ref:0723282bb80042897ca697416c050b4bf7fb5428 | 1 | 6246037 |
ref:9a24f7c0208b05d6be97077d855671d1dfdbc0dd | 1 | 5183641 |
ref:7c4765d26b6b678783fec763a62a05f82ef36291 | 1 | 4663919 |
ref:64141ed6d84b2cf105b1656d0c0f094358a3dd4f | 1 | 4141724 |
ref:43a0088c51fd85e5a85d1b46412c3a635e6d4edc | 1 | 3756463 |
ref:288ab581e7d2d02995a26dfa8b091d96e78457fc | 1 | 3047132 |
ref:6c44b0eb3905101f3d17982ef3fddb8cb2b3e278 | 1 | 2972781 |
ref:0ee3b3ba1c958f4c3dcba7ed8091fe4b57311348 | 1 | 2637075 |
ref:d5847b9b6032aa8b13dae3c2dfd9ed5d114d21b3 | 1 | 2595349 |
ref:3913844e06e055e8cd81608f22bad0e604d89d2d | 1 | 2547282 |
ref:bd49d3e4f67bc460ce7a06b6ac3027347cf5ee55 | 1 | 2397088 |
ref:d4bd87b862b12d99d26e86472d44f26858dee639 | 1 | 2330033 |
ref:efa0005ffbf7ddad87bc72240c9732b6a01f9f0e | 1 | 1997758 |
ref:eec9dbd6f74260dc8f8c2ee1b0ecd8c64d973be5 | 1 | 1728596 |
ref:377e4d758ca3aff7d42243bbd9df04682e6b611b | 1 | 1651288 |
ref:a29a646602abf65105ed0f39a44231c962ece9ee | 1 | 1463936 |
Let us explore some.
Reference #1
This section explores the reference with the most triples where it is a subject. ref:07b55fdfe5fba3fda539dbefb7196da0e3460d2b
with 102 triples. Some of the triples are shown below.
predicate | object |
---|---|
pr:P3452 | wd:Q41555988 |
pr:P3452 | wd:Q42605633 |
pr:P3452 | wd:Q42614357 |
pr:P3452 | wd:Q42612177 |
pr:P3452 | wd:Q42615213 |
pr:P3452 | wd:Q42615740 |
pr:P3452 | wd:Q42613597 |
This ref seems to be used only in one place.
subject | subject label | predicate | object |
---|---|---|---|
wd:Q36502461 | Allgemeiner Harz-Berg-Kalender | publisher | ref:07b55fdfe5fba3fda539dbefb7196da0e3460d2b |
Reference #2
This is the reference with the second most triples where it is a subject. ref:68e48339e339a3bda7932cac38f44abe27de1461
with 35 triples. Some of the triples are shown below.
predicate | object |
---|---|
pr:P854 | <https://hollisarchives.lib.harvard.edu/repositories/27/archival_objects/1368433> |
pr:P248 | wd:Q106715485 |
pr:P854 | <https://hollisarchives.lib.harvard.edu/repositories/27/archival_objects/1368440> |
prv:P813 | wdv:e06efec16adfbaad0a72e3b6d9fc28fe |
pr:P854 | <https://hollisarchives.lib.harvard.edu/repositories/27/archival_objects/1368443> |
pr:P813 | "2021-05-22T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> |
This ref also seems to be used only in one place.
subject | subject label | predicate | object |
---|---|---|---|
wd:Q3782554 | Lyres | has works in the collection | ref:68e48339e339a3bda7932cac38f44abe27de1461 |
Reference #3
It is a reference for KBpedia ID on a specific date. Example of where it is used: The KBpedia statement in Q2013. Some more places this reference is used is given below (See more using this SPARQL query).
subject | subject label | predicate | object |
---|---|---|---|
Q125 | November | KBpedia ID | ref:9a681f9dd95c90224547c404e11295f4f7dcf54e |
Q140 | lion | ||
Q144 | dog | ||
Q147 | kitten | ||
Q148 | People's Republic of China | ||
Q155 | Brazil | ||
Q177 | pizza | ||
Q178 | pasta | ||
Q2013 | Wikidata | ||
Q23 | George Washington |
This ref has 3 triples where it is a subject. The triples of this references are shown below.
predicate | object |
---|---|
prv:P813> | <http://www.wikidata.org/value/664bae4effccc18fd4ad1ae188fab025> |
pr:P813 | "2020-07-09T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> |
pr:P248 | <http://www.wikidata.org/entity/Q64139102> |
Reference #4
This is a reference for 'taxon common name' on a specific date. Some places it is used is given below. Notice that same item uses this references multiple times. I tried to put a couple of different items that use this reference.
subject | subject label | predicate | object |
---|---|---|---|
Q17970 | Jabiru mycteria | taxon common name | ref:9a681dbf31ebd5fd1d2006e0c492516e6c3d59d7 |
Q17970 | Jabiru mycteria | ||
Q18836 | Common Buttonquail | ||
Q18836 | Common Buttonquail | ||
Q26490 | Common Kestrel | ||
Q26620 | Common Redstart | ||
Q26657 | Goldcrest | ||
Q26685 | Atlantic Puffin |
This ref has 3 triples where it is a subject. The triples of this references are shown below.
predicate | object |
---|---|
prv:P813 | <http://www.wikidata.org/value/055167878d6ea2b50690069f330bb773> |
pr:P813 | "2016-10-16T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> |
pr:P248 | <http://www.wikidata.org/entity/Q27042747> |
Values
Values are nodes that hold some values like time or quantities along with precision, time zone etc. References for example have the direct values plus value nodes to hold more information about the values. For more, see Value_representation.
Analysis Note: Reference and Values have duplicates in hdfs due to the dumping process. In real triple store, they are deduplicated. For references and values, all distinct triples are considered.
- Total triples with value node as subject: 279313267 (279M)
- Total distinct value nodes: 61518273 (61M)
Type | count |
---|---|
<http://wikiba.se/ontology#QuantityValue> | 52564412 |
<http://wikiba.se/ontology#GlobecoordinateValue> | 8603170 |
<http://wikiba.se/ontology#TimeValue> | 350691 |
<http://wikiba.se/ontology#GeoAutoPrecision> | 93780 |
value | triple count | usage count |
---|---|---|
v:e2bd8f07b10701c92eacf58a0329b127 | 6 | 1 |
v:0bee50ecdf15d0e640ac9d69a68cdc76 | 6 | 1 |
v:c23b7a0348b297277984fde34e6a51ab | 6 | 3 |
v:596fec2ac8604b8607ecb4b5b83f468c | 6 | 3 |
v:383f9a273c8ea452d57902f936263369 | 6 | 1 |
v:c1a3dbbac3f8b4a37ffb91daa2e86317 | 6 | 24 |
v:1ab662093fe7836e96f3a5a780247c4a | 6 | 1 |
v:2c03608955c6e8eeba4e2e55805c4979 | 6 | 2 |
v:211d8f31b9a28f79dd20477e31e1c5ae | 6 | 1 |
v:f86ef017ad15adc68b55ada2df7f248a | 6 | 10 |
v:fb485962a17c8c5be3ad4c894a281c65 | 6 | 1 |
v:80356c41f851dcbbdb594e43ac82369d | 6 | 3 |
v:f2b33b065b12f1668cf13b3562cf19db | 6 | 1 |
v:bfc2ab241dafc94425fba4a642e6009d | 6 | 2 |
v:fec78667c7a021215ca59728258be16c | 6 | 1 |
v:40ea08aa3e1d6edc307d961fc9dc0b69 | 6 | 23 |
v:357e66846e15f89837bdc31792df2e6a | 6 | 4 |
v:748476f9b6f3daf4d9a6818798263d6c | 6 | 1 |
v:65113fbe866cec6af2d05e3aaa2bac0e | 6 | 2 |
v:e395a45e7e42cd416e2269fbdd1ab8f9 | 6 | 2 |
value | triple count | usage count |
---|---|---|
v:c610c7d0abbfe361e367744369f5d33d | 6 | 10121290 |
v:4e601d1880d647664093f1b20b24dacf | 5 | 4542586 |
v:d0931b31b1c31ffa1325777f65b723db | 5 | 3277169 |
v:7e281616976c7de150357c18e76abfd1 | 5 | 1452465 |
v:d40cb13acf8001d779efbb0c45cb42f0 | 5 | 1412411 |
v:8c2bdc5006a93f73a2b03849218b4e7b | 5 | 1217459 |
v:b3795d3425e0bbdd474f3138cad4a069 | 5 | 940261 |
v:202de63fcb2e0943a9b5d0cebf189569 | 5 | 685430 |
v:a843a14d6be3111e93a253fd623f18cf | 5 | 580686 |
v:dafb9cf711b15afe91ec0aa7158e57a6 | 5 | 573003 |
v:1c9e02c1631d3fd5ec9a9fe9aa1fde65 | 3 | 562672 |
v:67a17c05603b9e75b6c25826fa747705 | 3 | 486942 |
v:5a2515dd8960847405b294e0a7999403 | 5 | 466608 |
v:5a166c540a59253c92144230db78cb8a | 5 | 428426 |
v:8c0d739994215f213311d29254302049 | 5 | 422030 |
v:39c2e70b9990c3ed7be32f8e34015853 | 5 | 420452 |
v:b441aa14f32ad7a9e6fe04eb80002b4c | 5 | 411528 |
v:216f4f19c804fc50c737da9ae87494a9 | 5 | 387676 |
v:1ce0c285c67a65d1e2e620ace3f6c897 | 5 | 369781 |
v:ddd311e198ef615dbfaaa3f42aeec7b4 | 5 | 364341 |
I explore a few values below.
Value #1
Value node v:e2bd8f07b10701c92eacf58a0329b127
is QuantityValue
type. Places it is used and the triples it contains is shown below. The usage table shows what kind of value it is used as (i.e the predicate to which the value is a object of) how many times it is used with that predicate.
predicate | predicate label | count |
---|---|---|
P2216 | radial velocity | 1 |
Value #2
Value node v:e2ce06d70b150e202e4c81681e5334e5
is TimeValue
type.
predicate | predicate label | count |
---|---|---|
P577 | publication date | 35774 |
P580 | start time | 800 |
P582 | end time | 233 |
P570 | date of death | 43 |
P571 | inception | 36 |
P576 | dissolved, abolished or demolished date | 27 |
P7588 | effective date | 21 |
P585 | point in time | 16 |
P1619 | date of official opening | 5 |
P569 | date of birth | 2 |
P1191 | date of first performance | 1 |
P2960 | archive date | 1 |
P575 | time of discovery or invention | 1 |
P620 | time of spacecraft landing | 1 |
P729 | service entry | 1 |
Value #3
Value node v:e2ce017b6638b9684082390db9ce311f
is TimeValue
type.
predicate | predicate label | count |
---|---|---|
P813 | retrieved | 44722 |
P577 | publication date | 1048 |
P5017 | last update | 195 |
P585 | point in time | 64 |
P570 | date of death | 59 |
P580 | start time | 57 |
P582 | end time | 56 |
P2960 | archive date | 9 |
P6949 | announcement date | 2 |
P1319 | earliest date | 1 |
P2031 | work period (start) | 1 |
P571 | inception | 1 |
Top Predicates
Number of distinct predicates: 41117
predicate | count |
---|---|
schema:description | 2462371590 |
rdf:type | 1406804171 |
wikibase:rank | 1288379756 |
prov:wasDerivedFrom | 1003288418 |
rdfs:label | 495214648 |
p:P2860 | 247929898 |
ps:P2860 | 247929861 |
wdt:P2860 | 247928515 |
pq:P1545 | 157196523 |
p:P2093 | 135492184 |
ps:P2093 | 135492121 |
wdt:P2093 | 135397718 |
skos:altLabel | 102185811 |
p:P31 | 99091730 |
ps:P31 | 99091721 |
schema:dateModified | 97315730 |
schema:version | 97315148 |
wdt:P31 | 95759651 |
wikibase:statements | 93970378 |
wikibase:sitelinks | 93464760 |
wikibase:identifiers | 93464760 |
Assuming all wikidata related predicates are prefixed with wikidata.org
or wikiba.se
. Therefore anything other than these prefix are considered non-wiki predicates.
Both wiki and non-wiki predicates can have wiki or non-wiki obejcts.
- wiki predicate, non-wiki object: wd:Q:30 wdt:P2250 "78.69024"
- non-wiki predicate, wiki object: data:P31 schema:about wd:P31
predicate | count |
---|---|
schema:description | 2462371590 |
rdf:type | 1406804171 |
prov:wasDerivedFrom | 1003288418 |
rdfs:label | 495214648 |
skos:altLabel | 102185811 |
schema:dateModified | 97315730 |
schema:version | 97315148 |
schema:about | 78517643 |
schema:inLanguage | 78516557 |
schema:isPartOf | 78516557 |
schema:name | 78516557 |
ontolex:representation | 9957440 |
ontolex:lexicalForm | 8730481 |
owl:sameAs | 3344770 |
dct:language | 496678 |
skos:definition | 142252 |
ontolex:sense | 128287 |
owl:onProperty | 8940 |
owl:complementOf | 8940 |
owl:someValuesFrom | 8940 |
owl:imports | 1 |
cc:license | 1 |
schema:softwareVersion | 1 |
predicate | predicate label | count |
---|---|---|
wikibase:rank | 1288379756 | |
p:P2860 | cites work | 247929898 |
ps:P2860 | cites work | 247929861 |
wdt:P2860 | cites work | 247928515 |
pq:P1545 | series ordinal | 157196523 |
p:P2093 | author name string | 135492184 |
ps:P2093 | author name string | 135492121 |
wdt:P2093 | author name string | 135397718 |
p:P31 | instance of | 99091730 |
ps:P31 | instance of | 99091721 |
wdt:P31 | instance of | 95759651 |
wikibase:statements | 93970378 | |
wikibase:sitelinks | 93464760 | |
wikibase:identifiers | 93464760 | |
pr:P248 | stated in | 76940160 |
pr:P813 | retrieved | 74812531 |
prv:P813 | retrieved | 74812525 |
pr:P854 | reference URL | 56837807 |
wikibase:quantityAmount | 52564412 | |
wikibase:quantityUnit | 52564412 | |
p:P1476 | title | 40923677 |
ps:P1476 | title | 40921787 |
wdt:P1476 | title | 40916525 |
p:P577 | publication date | 39645641 |
ps:P577 | publication date | 39645462 |
psv:P577 | publication date | 39645061 |
wdt:P577 | publication date | 39633371 |
p:P1433 | published in | 37349691 |
ps:P1433 | published in | 37349680 |
wdt:P1433 | published in | 37348812 |
wikibase:quantityNormalized | 35418096 | |
p:P304 | page(s) | 34875085 |
ps:P304 | page(s) | 34875081 |
wdt:P304 | page(s) | 34875052 |
p:P478 | volume | 34665675 |
ps:P478 | volume | 34665668 |
wdt:P478 | volume | 34665659 |
psv:P1215 | apparent magnitude | 33123781 |
ps:P1215 | apparent magnitude | 33123781 |
p:P1215 | apparent magnitude | 33123781 |
pq:P1227 | astronomical filter | 33123753 |
p:P698 | PubMed ID | 31983819 |
ps:P698 | PubMed ID | 31983819 |
wdt:P698 | PubMed ID | 31948158 |
p:P433 | issue | 31751756 |
ps:P433 | issue | 31751753 |
wdt:P433 | issue | 31751748 |
p:P528 | catalog code | 28701730 |
ps:P528 | catalog code | 28701676 |
wdt:P528 | catalog code | 28698943 |
Object
Obejects have too many distinct values especially because of literals (string, numeral, time, date etc), so getting a list of top objects would not serve any useful purpose. Rather I looked at the types of objects in Wikidata.
Broadly speaking, objects can be:
- Literals: Literals can have datatypes
- URIs: May or may not have <type> predicate specified.
- Wiki URI
- Non-wiki URI
Of the objects that are URI and have a <type> predicate in wikidata, let us find the top types of object. Note that it is not the count of object usage or occurance, rather just the number of distinct objects with that <type>. More on the distribution of the number of triples with each kind of object can be found in the Wiki vs Non-wiki Triples section.
Type of object URI | count |
---|---|
wikibase:BestRank | 1256923711 |
wikibase:QuantityValue | 52564412 |
ontolex:Form | 8730481 |
wikibase:GlobecoordinateValue | 8603170 |
wikibase:TimeValue | 350691 |
ontolex:LexicalSense | 128287 |
wikibase:GeoAutoPrecision | 93780 |
owl:ObjectProperty | 68987 |
wdno:P364 | 60594 |
wdno:P17 | 39925 |
schema:Article | 33268 |
wdno:P155 | 32010 |
owl:DatatypeProperty | 29805 |
ontolex:LexicalEntry | 18917 |
owl:Class | 8940 |
owl:Restriction | 8940 |
wdno:P814 | 8713 |
wikibase:Property | 8605 |
wdno:P156 | 8457 |
wdno:P162 | 8191 |
Of the objects that are literals, we can find the datatype of the literals. The table below count the number of objects with a specific data type.
dtype | count | precentage |
---|---|---|
<http://www.w3.org/2001/XMLSchema#integer> | 397670762 | 36.22 |
<http://www.w3.org/2001/XMLSchema#decimal> | 344236061 | 31.35 |
<http://www.w3.org/2001/XMLSchema#dateTime> | 312169343 | 28.43 |
<http://www.w3.org/2001/XMLSchema#double> | 26333634 | 2.39 |
<http://www.opengis.net/ont/geosparql#wktLiteral> | 17557277 | 1.60 |
<http://www.w3.org/1998/Math/MathML> | 45058 | 0.004 |
Wiki vs Non-wiki Triples
Triples can have objects that are later expanded within wikidata. These objects have to be wiki objects. The objects that start with the prefix wikidata.org
or wikiba.se
are considered wiki objects. But all wiki objects not tend to 'expand' within wikidata. This is calculated by find the number of wiki objects that also occur as subjects. If they do occur as subject, then they have 'expanded' within wikidata.
Triples can also have non-wiki objects, and non-wiki objects cannot be expanded in wikidata. Non-wiki objects do not start with the wikidata.org
or wikiba.se
prefixes. Non-wiki objects can be URIs or literals. The idea is that since these objects do not expand within wikidata, they are leaves in the graph, and so they can be modelled as properties of the associated node in Property graphs.
The distribution of the number of triples containing different types of objects. All percentages expressed are the percentage of the total number of triples.
Object type | # Triples | % Total triples | Object type | # Triples | % Total triples | Object type | # Triples | % Total triples |
---|---|---|---|---|---|---|---|---|
Wiki object | 6828025880 | 52.9 | Wikidata object | 4221092432 | 32.7 | Object also subject | 4220239722 | 32.7 |
Object not subject | 852710 | 0.00006 | ||||||
Wikiba.se object | 2606933448 | 20.2 | Object also subject | 0 | 0 | |||
Object not subject | 2606933448 | 20.2 | ||||||
Non-wiki object | 5843911006 | 45.3 | URI object | 391262595 | 3 | |||
Literal object | 5452648411 | 42.2 |
If the table is too much to digest, here is a simple diagram with the same information but in a more colorful and beautiful way.
Expanded vs Unexpanded objects
This section does some analysis mainly to verify that wikidata objects do expand within wikidata. This is a premise to the idea that non-wiki objects don't expand within wikidata and are leaves in the graph.
- Triples with wiki objects that also appear as subjects (therefore do expand in wikidata): 4220239722 (32.7%)
- Number of triples with
wikidata.org
object that are also subject: 4220239722 (32.7%) - Number of triples with
wikiba.se
object that are also subject: 0
- Number of triples with
- Total wikidata-objects that are not expanded in wikidata (not subjects): 852710
- Distinct wikidata-objects that are not expanded in wikidata (not subjects): 852319 (Most are distinct)
Ideally we expect that if a wikidata entry is a subject, it should have some relevant information on wikidata. But 852710 triples have objects that do not have corresponding subjects. Wikidata objects that do not expand in wikidata and the count of their occurance as objects are shown below. Upon manual inspection it seems the Q-items are deleted entries. This can explain why they were not available as a subject in wikidata. Nevertheless, the triples that used them as objects still persist.
object | count |
---|---|
wd:Q68637652 | 148 |
<http://www.wikidata.org/.well-known/genid/abf15f7aee46e00705c700147bd53518> | 32 |
wd:Q28968053 | 24 |
<http://www.wikidata.org/.well-known/genid/3d27111745b4e92877aa4c1ad765e5ca> | 15 |
wd:L229411-S1 | 14 |
wd:Q35104224 | 14 |
wd:Q58331113 | 12 |
wd:undefined | 10 |
wd:Q107006232 | 10 |
<http://www.wikidata.org/.well-known/genid/fa575516a51320c3beb70f7719f72a99> | 10 |
<http://www.wikidata.org/.well-known/genid/170fd20cd570feefeba7cbc7be44c853> | 9 |
Objects Per Item
While it's great we know how many non-wiki objects we have and that we can consider them as properties in a property graph, we still don't know how spread out they are. Are most of the non-wiki objects in few items, or most items have a lot of non-wiki objects, etc. If we find the number of non-wiki triples per item we can try to infer these questions.
Object type | max | min | avg | std |
---|---|---|---|---|
Non-wiki | 5311 | 1 | 5.59 | 19.86 |
Literal | 5309 | 1 | 5.32 | 19.94 |
Wiki | 16667 | 1 | 4.21 | 8.7 |
Top Non-wiki objects per subject | Top literals per subject | Top Wiki objects per subject | |||
---|---|---|---|---|---|
subject | count | subject | count | subject | count |
wd:Q56836084 | 5311 | wd:Q56836084 | 5309 | wd:Q39790431 | 16667 |
wd:Q106988069 | 5125 | wd:Q106988069 | 5123 | wd:Q27972199 | 6208 |
wd:Q64022985 | 4492 | wd:Q64022985 | 4491 | wd:Q57661806 | 6086 |
wd:Q56883844 | 4489 | wd:Q56883844 | 4489 | wd:Q21558717 | 6018 |
wd:Q57920219 | 4484 | wd:Q57920219 | 4482 | wd:Q56754739 | 5986 |
wd:Q21558717 | 4328 | wd:Q21558717 | 4327 | wd:Q56895655 | 5970 |
wd:Q56754739 | 4238 | wd:Q56754739 | 4236 | wd:Q63409374 | 5964 |
wd:Q56895655 | 4218 | wd:Q56895655 | 4216 | wd:Q64022985 | 5894 |
wd:Q174565 | 4049 | wd:Q174565 | 4043 | wd:Q56836084 | 5652 |
wd:Q58231267 | 3812 | wd:Q58231267 | 3810 | wd:Q106988069 | 5165 |
wd:Q467925 | 3277 | wd:Q467925 | 3272 | wd:Q58231267 | 4920 |
wd:Q104369389 | 2983 | wd:Q104369389 | 2982 | wd:Q56489295 | 4876 |
wd:Q100507117 | 2974 | wd:Q100507117 | 2973 | wd:Q33928881 | 4817 |
wd:Q104798012 | 2930 | wd:Q104798012 | 2929 | wd:Q28388335 | 4763 |
wd:Q98468691 | 2917 | wd:Q98468691 | 2916 | wd:Q57920219 | 4695 |
wd:Q98730204 | 2914 | wd:Q98730204 | 2913 | wd:Q56883844 | 4682 |
wd:Q96613392 | 2881 | wd:Q96613392 | 2880 | wd:Q57735077 | 4559 |
wd:Q104467608 | 2877 | wd:Q104467608 | 2876 | wd:Q35952737 | 4435 |
wd:Q21521425 | 2868 | wd:Q21521425 | 2867 | wd:Q35202929 | 4380 |
wd:Q21521423 | 2827 | wd:Q21521423 | 2826 | wd:Q30486707 | 4129 |