User:AKhatun/Wikidata Vertical Analysis
Wikidata is an open knowledge base in the form of a graph accessible through SPARQL queries (among other things). The graph is formed using triples in the form - Subject, Predicate, Object. These components connect each other forming a huge interconnected web of data. Wikidata is growing super fast and it is time to think scaling. With this aim, this page shows some analysis on Wikidata to find out:
- Amount of certain kinds of vertical data like labels, descriptions, scientific articles etc
- How many queries ask for each of these vertical slices
- Analysis of these queries
Phabricator tickets: T282790, T291190.
TL;DR
If blazegraph (wikidata's backend) were to fail, what can we remove temporarily from wikidata so that it can keep functioning? Some data points found across items in wikidata such as labels, descriptions, identifiers etc are possible candidates. Analysis done on these vertical data are described in the following sections.
"Number of days for Wikidata to recover" is the estimated number of days for Wikidata to get back to its current size if some amount of triples is removed from Wikidata. To clarify: Descriptions form ~20% of Wikidata triples. If we were to remove them, then given the rate at which Wikidata is growing, it would take ~500 days for Wikidata to jump back to its current size, despite removing the descriptions. See more about Wikidata growth rate below.
|
Vertical Data Analysis
Wikidata snapshot of 20210712
was used for this analysis.
Total triples
Before we begin, the total number of triples in this specific snapshot of wikidata is 12671768950, approximately 12.6 billion. The growth rate of triples is not constant, but considering the growth an approximate straight line, in grafana dashboard, Wikidata grows at a rate of 4.77 million triples per day. This rate was calculated from the number of triples at the start and end of a 90-day interval (11/3/21 to 6/6/21). During this period wikidata grew 3.38%!.
Following analyses are done assuming 4.77M triples per day growth where applicable, therefore take numbers as an approximate only. To repeat, the wikidata growth rate is not constant, the 4.77M per day growth is a wide approximation.
Description
The number of triples with the predicate schema:description
is 2471378661, 19.5% of all triples.
Description | Triple Count | Triple % | Number of days for Wikidata to recover |
---|---|---|---|
English | 72609016 | 0.57 | 15 |
Other Languages | 2398769645 | 18.93 | 502.8 |
Total | 2471378661 | 19.5 | 518 |
Additional Info
Some more information of descriptions.
Number of items that have a description | 87048501 |
---|---|
Average description per item | 28.4 |
Maximum description count per item | 258 |
Number of item with one description | 9910091 (11% of items) |
Number of item with more than one description | 77138410 (88% of items) |
Number of items that have a English description | 72609016 |
Number of items that don't have English descriptions | 14439485 (16.6%) |
Therefore, 16.6% of all items that have a description don't have English descriptions. If we were to remove all non-English description, 16.6% items that had a description won't have a description anymore.
Distribution of descriptions per item
Description per Item | Count | Count % | Cummulative % |
---|---|---|---|
1 | 9910091 | 11.38 | 11.38 |
2 | 10845750 | 12.46 | 23.84 |
3 | 13939579 | 16.01 | 39.85 |
4 | 5221876 | 6.00 | 45.85 |
5 | 3180051 | 3.65 | 49.50 |
6 | 2061753 | 2.37 | 51.87 |
7 | 1456036 | 1.67 | 53.54 |
8 | 938750 | 1.08 | 54.62 |
9 | 918864 | 1.06 | 55.68 |
10 | 886663 | 1.02 | 56.70 |
Language distribution of descriptions
440 different language tags in descriptions. 50% of the descriptions are of 32 languages and 90% of the descriptions are of 94 languages.
Language tag | Description count | Description % |
---|---|---|
nl | 75405965 | 3.05 |
en | 72609016 | 2.94 |
de | 61716292 | 2.50 |
ar | 45939199 | 1.86 |
fr | 42861255 | 1.73 |
es | 39989399 | 1.62 |
uk | 39859846 | 1.61 |
ast | 38642801 | 1.56 |
ca | 36901411 | 1.49 |
bn | 36750936 | 1.49 |
Extra distribution figures in Jupyter Notebook # Description ## Distribution of language tags
Label
The number of triples with the predicate rdfs:label
is 499663174, 3.9% of all triples.
Label | Triple Count | Triple % | Number of days for Wikidata to recover |
---|---|---|---|
English | 79778129 | 0.6 | 16 |
Other Languages | 419885045 | 3.3 | 88 |
Total | 499663174 | 3.9 | 104 |
Additional Info
Some more information of labels.
Number of items that have a label | 93474062 |
---|---|
Avgerage label per item | 5.34 |
Maximum label count per item | 446 |
Number of item with one label | 20084825 (21% of items) |
Number of item with more than one label | 73389237 (78% of items) |
Number of items that have a English label | 79778129 |
Number of items that don't have English labels | 13695933 (14.65%) |
Therefore, 14.7% of all items that have a label don't have English labels. If we were to remove all non-English labels, 14.7% that had a label won't have a label anymore.
Distribution of labels per item
Label per Item | Count | Count % | Cummulative % |
---|---|---|---|
1 | 20084825 | 21.49 | 21.49 |
2 | 41697507 | 44.61 | 66.10 |
3 | 10030895 | 10.73 | 76.83 |
4 | 4988361 | 5.34 | 82.17 |
5 | 2568068 | 2.75 | 84.92 |
6 | 1857891 | 1.99 | 86.91 |
7 | 1366863 | 1.46 | 88.37 |
8 | 1592480 | 1.70 | 90.07 |
9 | 683102 | 0.73 | 90.80 |
10 | 731273 | 0.78 | 91.58 |
Language distribution of labels
476 different language tags in labels. 40% of the labels are of only 6 languages and 50% of the labels are of 12 languages.
Language tag | Label count | Label % |
---|---|---|
en | 79778129 | 15.97 |
nl | 56940665 | 11.40 |
ast | 16106324 | 3.22 |
fr | 14594937 | 2.92 |
de | 14352435 | 2.87 |
es | 13005130 | 2.60 |
ga | 9162180 | 1.83 |
it | 9090037 | 1.82 |
bn | 8531392 | 1.71 |
pt | 7966495 | 1.59 |
More distribution figures in Jupyter Notebook # Labels ## Distribution of language tags
Other predicates like Label
Other predicates are skos:altLabel
, schema:name
.
Note that there are no triples with the predicate skos:prefLabel
.
Predicate | Triple Count | Triple % | Number of days for Wikidata to recover |
---|---|---|---|
shema:name | 78785768 | 0.62 | 16.5 |
skos:altLabel | 102593854 | 0.81 | 21.5 |
rdfs:label | 499663174 | 3.9 | 104 |
Label | Triple Count | Triple % | Number of days for Wikidata to recover |
---|---|---|---|
English schema:name | 13721324 | 0.11 | 3 |
Other Languages schema:name | 65064444 | 0.51 | 13.6 |
English skos:altLabel | 9157038 | 0.07 | 2 |
Other Languages skos:altLabel | 65064444 | 0.74 | 19.5 |
English rdfs:label | 79778129 | 0.6 | 16 |
Other Languages rdfs:label | 419885045 | 3.3 | 88 |
Total English | 102656491 | 0.8 | 21.5 |
Total Other Language | 550013933 | 4.3 | 115 |
Total | 652670424 | 5.15 | 137 |
More distributions in Jupyter Notebook # altLabels and Jupyter Notebook # schema:name
External Identifier
Identifiers are properties, like P297. They are wikibase:propertyType wikibase:ExternalId
, i.e they are of property type External ID. Example identifiers are UNBIS Thesaurus ID, BBK (library and bibliographic classification), Symptom Ontology ID, Bilibili bangumi ID etc. There are 6322 distinct external identifiers (as of 10 August, 2021).
These properties appear as /prop, meaning the object is a statement and holds more information. Or as /prop/direct (and /prop/direct-normalized) meaning the object a single URI or literal, doesn't hold more information than that.
Triple type | Triple Count | % Triple | Number of days for Wikidata to recover |
---|---|---|---|
external identifiers as /prop | 179679329 | 1.4 | 37 |
external identifiers as /prop/direct | 179486550 | 1.4 | 37 |
external identifiers as /prop/direct-normalized | 63666217 | 0.5 | 13 |
triples of /prop statement | 717745459 | 5.6 | 150 |
Total | 1140577555 | 8.897 | 239 |
*Note that triples that define the IDs themselves are not included here. Those are in the range of 0.009% of the entire dataset.
ID | ID label | Triple Count | % Triples with ID | % Cummulative |
---|---|---|---|---|
P356 | DOI | 81479716 | 19.27 | 19.27 |
P698 | PubMed ID | 63920010 | 15.12 | 34.39 |
P2671 | Google Knowledge Graph ID | 22127898 | 5.23 | 39.62 |
P3083 | SIMBAD ID | 16316711 | 3.86 | 43.48 |
P646 | Freebase ID | 13274336 | 3.14 | 46.62 |
P932 | PMCID | 12706776 | 3.01 | 49.63 |
P1566 | GeoNames ID | 11115096 | 2.63 | 52.26 |
P5875 | ResearchGate publication ID | 9157349 | 2.17 | 54.43 |
P214 | VIAF ID | 8108557 | 1.92 | 56.35 |
P496 | ORCID iD | 5222825 | 1.24 | 57.59 |
P846 | GBIF taxon ID | 4573391 | 1.08 | 58.67 |
P244 | Library of Congress authority ID | 3894577 | 0.92 | 59.59 |
P227 | GND ID | 3763072 | 0.89 | 60.48 |
P7859 | WorldCat Identities ID | 3667589 | 0.87 | 61.35 |
P6179 | Dimensions Publication ID | 3080555 | 0.73 | 62.08 |
P2326 | GNS Unique Feature ID | 2935976 | 0.69 | 62.77 |
P5055 | IRMNG ID | 2717119 | 0.64 | 63.41 |
P213 | ISNI | 2659310 | 0.63 | 64.04 |
P235 | InChIKey | 2531375 | 0.60 | 64.64 |
P234 | InChI | 2516244 | 0.60 | 65.24 |
Around 19% of the triples with external IDs are triples related to P356 (DOI), 15% to P698 (PubMed ID). 64 (out of 6322) IDs form 80% of the triples having external IDs, 209 form 90%. See more with figures in Jupyter Notebook # External Identifiers
Query Analysis
WDQS external queries of 08/2021
was used for this analysis. All the following numbers were calculated for monthly
data.
Note that:
- Only the queries that contain direct mention of the predicates were considered. Generic open ended queries that happen to match the predicates were not considered here. For example, queries like
?sub ?pred ?obj
or?sub ?obj "label_string"
were not counted, but queries like?sub rdfs:label ?obj
or?sub rdfs:label "label_string"
were taken into consideration. - The query counts and percentages are not mutually exclusive across vertical slices. Queries that contain rdfs:label, for example, more often than not also contain skos:altLabel. Such queries increase counts for both categories.
Total Queries
- Total number of monthly queries: ~190M
- Total monthly query execution time: ~14,000 hours
|
Description
- Number of queries where schema:description occurs anywhere in the query (predicate/object/VALUES etc): 21,863,454
- Number of queries where schema:description is the predicate (subset of the former): 21,862,863
- Number of queries where schema:description is part of a more complex path: 4
- Total number of queries with
schema:description
: 21,863,454, which is 12% of the monthly queries. - Queries with descriptions make up 2,600 hours or 18.65% of monthly query time.
User agent | Number of queries | % of description queries | % of all queries |
---|---|---|---|
searx/1.0.0 | 5,000,782 | 22 | 2.7 |
UA#X | 3,532,030 | 16 | 1.9 |
Python-urllib/3.6 | 3,233,151 | 14.7 | 1.77 |
searx/0.18.0 | 2,313,343 | 10 | 1.27 |
searx/1.0.0-unknown | 952,243 | 4.3 | 0.52 |
User agent | Query time (hr) | % time of description queries | % time of all queries |
---|---|---|---|
searx/1.0.0 | 638 | 24.5 | 4.5 |
Python-urllib/3.6 | 335 | 12.8 | 2.4 |
searx/0.18.0 | 256 | 9.8 | 1.8 |
UA-X | 204 | 7.8 | 1.4 |
UA-X | 156 | 6.0 | 1.1 |
Label
- Number of queries where rdfs:label occurs anywhere in the query (predicate/object/VALUES etc): 42,883,256
- Number of queries where rdfs:label is the predicate (subset of the former): 40,532,779
- Number of queries where wikibase:label service is used to access labels: 72,936,044
- Number of queries where rdfs:label is part of a more complex path: 2,537,238
- Total number of queries with
rdfs:label
: 88,861,469, which is 48.8% of the monthly queries. - Queries with labels make up 10,000 hours or 72% of monthly query time.
User agent | Number of queries | % of label queries | % of all queries |
---|---|---|---|
UA-X | 7,817,612 | 8.8 | 4.3 |
wikidataintegrator/0.8.4 | 6,999,877 | 7.9 | 3.8 |
NERBot/0.0 | 6,118,205 | 6.9 | 3.36 |
searx/1.0.0 | 5,097,913 | 5.7 | 2.8 |
UA-X | 3,532,030 | 3.9 | 1.9 |
Pywikibot/6.1.0 | 3,347,977 | 3.7 | 1.8 |
Python-urllib/3.6 | 3,233,689 | 3.6 | 1.77 |
UA-X | 2,947,910 | 3.3 | 1.62 |
UA-X | 2,502,071 | 2.8 | 1.37 |
searx/0.18.0 | 2,346,649 | 2.6 | 1.29 |
WikidataQueryServiceR | 2,131,186 | 2.4 | 1.17 |
User agent | Query time (hr) | % time of label queries | % time of all queries |
---|---|---|---|
UA-X | 1679 | 16.63 | 12.04 |
UA-X | 831 | 8.22 | 5.95 |
searx/1.0.0 | 646 | 6.4 | 4.63 |
UA-X | 471 | 4.67 | 3.38 |
Python-urllib/3.6 | 335 | 3.32 | 2.4 |
NERBot/0.0 | 291 | 2.88 | 2.08 |
UA-X | 285 | 2.82 | 2.04 |
searx/0.18.0 | 259 | 2.56 | 1.85 |
UA-X | 200 | 1.99 | 1.44 |
UA-X | 156 | 1.54 | 1.11 |
searx/1.0.0-unknown | 110 | 1.09 | 0.79 |
altLabel
- Number of queries where skos:altLabel occurs anywhere in the query (predicate/object/VALUES etc): 29,325,216
- Number of queries where skos:altLabel is the predicate (subset of the former): 25,928,709
- Number of queries where skos:altLabel is part of a more complex path: 2,470,100
- Total number of queries with
skos:altLabel
: 29,325,216, which is 16% of the monthly queries. - Queries with altLabels make up 800 hours or 5% of monthly query time.
User agent | Number of queries | % of altLabel queries | % of all queries |
---|---|---|---|
Toolforge - mix-n-match | 20,215,093 | 68 | 11 |
Python-urllib/3.6 | 3,233,151 | 11 | 1.77 |
UA-X | 919,678 | 3 | 0.5 |
UA-X | 655,385 | 2.2 | 0.36 |
UA-X | 388,507 | 1.3 | 0.2 |
User agent | Query time (hr) | % time of altLabel queries | % time of all queries |
---|---|---|---|
Python-urllib/3.6 | 335 | 40.5 | 2.4 |
Toolforge - mix-n-match | 141 | 17.1 | 1 |
UA-X | 75 | 9.1 | 0.54 |
UA-X | 31 | 3.8 | 0.22 |
UA-X | 30 | 3.6 | 0.21 |
Name
- Number of queries where schema:name occurs anywhere in the query (predicate/object/VALUES etc): 14,965,990
- Number of queries where schema:name is the predicate (subset of the former): 14,964,300
- Number of queries where schema:name is part of a more complex path: 1,327
- Total number of queries with
schema:name
: 14,965,990, which is 8% of the monthly queries. - Queries with labels make up 1,800 hours or 13% of monthly query time.
User agent | Number of queries | % of schema:name queries | % of all queries |
---|---|---|---|
searx/1.0.0 | 5,001,390 | 33 | 2.7 |
searx/0.18.0 | 2,313,343 | 15 | 1.27 |
searx/1.0.0-unknown | 952,243 | 6 | 0.52 |
searx/1.0.0-200-313a9847 | 513,220 | 3.4 | 0.28 |
WikidataIdTool/1.0 | 507,367 | 3.4 | .27 |
User agent | Query time (hr) | % time of schema:name queries | % time of all queries |
---|---|---|---|
searx/1.0.0 | 638 | 34.9 | 4.5 |
searx/0.18.0 | 256 | 14 | 1.8 |
searx/1.0.0-unknown | 110 | 6 | 0.8 |
searx/1.0.0-200-313a9847 | 66 | 3.6 | 0.5 |
searx/1.0.0-211-968b2899 | 51 | 2.8 | 0.3 |
External Identifiers
External Identifiers are those that have the wikibase:propertyType
of wikibase:ExternalId
. There are ~6500 such properties (as of Sep, 2021). Ids with the top usage in queries, occurring anywhere in the query from predicate, objects, to VALUES table etc, are shown below. Again, the counts are not mutually exclusive since the same query can host multiple of these properties.
Id P value | Id name | Query count | % of all queries |
---|---|---|---|
P345 | IMDb ID | 16304253 | 8.967 |
P2013 | Facebook ID | 14233516 | 7.828 |
P2002 | Twitter username | 14040155 | 7.722 |
P212 | ISBN-13 | 13646249 | 7.505 |
P2003 | Instagram username | 13518766 | 7.435 |
P218 | ISO 639-1 code | 13512920 | 7.432 |
P957 | ISBN-10 | 13455799 | 7.400 |
P498 | ISO 4217 code | 13422212 | 7.382 |
P2397 | YouTube channel ID | 13389689 | 7.364 |
P434 | MusicBrainz artist ID | 13313017 | 7.322 |
P1651 | YouTube video ID | 13291716 | 7.310 |
P436 | MusicBrainz release group ID | 13288620 | 7.309 |
P435 | MusicBrainz work ID | 13259330 | 7.292 |
P966 | MusicBrainz label ID | 13254823 | 7.290 |
P846 | GBIF taxon ID | 6650333 | 3.658 |
P300 | ISO 3166-2 code | 6594479 | 3.627 |
P691 | NKCR AUT ID | 3992781 | 2.196 |
P214 | VIAF ID | 1920419 | 1.056 |
P698 | PubMed ID | 1843036 | 1.014 |
P2949 | WikiTree person ID | 1698142 | 0.934 |
- Total number of queries with External Ids: 55,127,216, which is 30% of the monthly queries.
- Queries with external ids make up 5,500 hours or 39% of monthly query time.
User agent | Number of queries | % of external Id queries | % of all queries |
---|---|---|---|
wikidataintegrator/0.8.4 | 6993803 | 12 | 3.8 |
Rust mediawiki API/0.2.7 | 6603929 | 11 | 3.6 |
Hub | 6231886 | 11 | 3.4 |
searx/1.0.0 | 5097329 | 9 | 2.8 |
Googlebot/2.1 | 2614683 | 4 | 1.43 |
User agent | Query time (hr) | % time of external Id queries | % time of all queries |
---|---|---|---|
UA-X | 1300 | 23.8 | 9.3 |
searx/1.0.0 | 646 | 11.8 | 4.6 |
searx/0.18.0 | 259 | 4.7 | 1.8 |
Needle/0.9.2 | 205 | 3.7 | 1.4 |
UA-X | 204 | 3.7 | 1.4 |