User:AKhatun/WDQS Triples Analysis
The following analysis is done on the SPARQL queries (public cluster) from Wikidata Query Service (WDQS). The data is of 10 May, 2021. The analysis is part of a broader analysis on WDQS to help scale the service. This page shows the analysis of triples extracted from SPARQL queries. Some analysis was done on SPARQL queries earlier without explicitly extracting triples from them (See User:Joal/WDQS_Queries_Analysis and User:Joal/WDQS_Traffic_Analysis).
Phabricator Task: T282129
Jupyter Notebook of Analysis
Structure of a SPARQL query
A triple consists of a Subject, Predicate and an Object. Each of them can be called a Node. Each node can be several things like Variable, URI, Literal, Path and Blank Node.
The SPARQL queries taken from the WDQS are first processed to extract the triples among other things. Not all queries are successfully processed due to presence of uncommon prefixes (mostly indicating that the query uses things other than WDQS in the backend, like mwapi). Around 92% of queries are parsed and processed correctly.
Example of a sample SPARQL query is shown below along with what the extracted triples look like.
The Query:
SELECT * WHERE { ?s wdt:P31/wdt:P279 <_:bn>; skos:altLabel "alias"@en. }
The extracted triples:
[ TripleInfo( NodeInfo(NODE_VAR,s), //subject NodeInfo(PATH,wdt:P31/wdt:P279), //predicate NodeInfo(NODE_BLANK,bn) //object ), TripleInfo( NodeInfo(NODE_VAR,s), //subject NodeInfo(NODE_URI,skos:altLabel), //predicate NodeInfo(NODE_LITERAL,alias@en) //object ) ]
Node Analysis
Combining all the nodes from subject, predicate, and object the count of the type of node is shown below.
NodeType | Count | Count % |
---|---|---|
NODE_URI | 48256552 | 53.37 |
NODE_VAR | 29871236 | 33.04 |
NODE_LITERAL | 11110653 | 12.29 |
PATH | 1172584 | 1.29 |
NODE_BLANK | 2 | 0.00 |
But the node type distribution for different nodes are often different. For example subjects are usually URIs and objects are usually variables or literals. So the node type of each of subject, predicate and object is also determined separately. The percentages are shown column wise (By subject, predicate or object type).
NodeType | Subject | Predicate | Object | |||
---|---|---|---|---|---|---|
count | percent | count | percent | count | percent | |
NODE_URI | 16104919 | 53.43 | 27011280 | 89.63 | 5140353 | 17.06 |
NODE_VAR | 14032060 | 46.56 | 1953145 | 6.48 | 13886031 | 46.07 |
NODE_LITERAL | 30 | 0.00 | 0 | 0.00 | 11110623 | 36.87 |
PATH | 0 | 0.00 | 1172584 | 3.89 | 0 | 0.00 |
NODE_BLANK | 0 | 0.00 | 0 | 0.00 | 2 | 0.00 |
Node Values Distribution
The values of node indicate the variables name for Variable node, or the URI for URI node, the Path for Path node, the literal value for the Literal node etc. Node value analysis can be done in a variety of ways:
- Top Values for Sub/Pred/Obj nodes
- Top Values for each type of node (URI, literal, Var)
- Top values of each type of node, for each of Sub/Pred/Obj. Subjects tend to have more URIs, Objects tend to have more variables etc.
These analysis are shown in brief in the Jupyter Notebook.
These are based only on the data of 10th May, 2021.
Triple Analysis
Triples distribution based on node types
A triple is made of of Subject, Predicate, and Object. Each node is, as mentioned, classified into types such as Variable, URI etc. Replacing each node in a triple with its type we can obtain a triple in the form - for example (NODE_URI, PATH, NODE_LITERAL)
. Distribution of triples in this format can give us insight into what kind of triples are more used. Using more URIs or Literals means the person writing the SPARQL knows exactly what to ask for, whereas more Variables mean searching in a greater portion of the graph. Ofcourse this also depends on the size of the subgraphs in question. Therefore more analysis needs to be done to get deeper infromation from this.
Triple | Count | Count % |
---|---|---|
NODE_URI NODE_URI NODE_LITERAL | 8268924 | 27.44 |
NODE_VAR NODE_URI NODE_VAR | 7443036 | 24.70 |
NODE_URI NODE_URI NODE_VAR | 4997654 | 16.58 |
NODE_URI NODE_URI NODE_URI | 2508824 | 8.32 |
NODE_VAR NODE_URI NODE_LITERAL | 2113018 | 7.01 |
NODE_VAR NODE_URI NODE_URI | 1679795 | 5.57 |
NODE_VAR NODE_VAR NODE_VAR | 943751 | 3.13 |
NODE_VAR PATH NODE_URI | 862532 | 2.86 |
NODE_VAR NODE_VAR NODE_LITERAL | 721584 | 2.39 |
NODE_VAR PATH NODE_VAR | 216456 | 0.72 |
NODE_URI NODE_VAR NODE_VAR | 204853 | 0.68 |
NODE_URI PATH NODE_VAR | 80251 | 0.27 |
NODE_VAR NODE_VAR NODE_URI | 44789 | 0.15 |
NODE_URI NODE_VAR NODE_URI | 38165 | 0.13 |
NODE_VAR PATH NODE_LITERAL | 7097 | 0.02 |
NODE_URI PATH NODE_URI | 6248 | 0.02 |
NODE_LITERAL NODE_URI NODE_VAR | 29 | 0.00 |
NODE_VAR NODE_VAR NODE_BLANK | 2 | 0.00 |
NODE_LITERAL NODE_VAR NODE_VAR | 1 | 0.00 |
Triples distribution based on node values
While node type distribution gives us some information, we still don't know where in the wikidata graph people are searching most. Or what kind of services they are using most. URIs, Paths and Literals give us this information. It is better that the variables and blank nodes remain obfuscated since one can use any variable name to search the same information or when writing the same sparql query. Therefore the distribution of triples with variable and blank nodes obfuscated is shown below.
Triple | count |
---|---|
bd:serviceParam wikibase:language en | 3717731 |
NODE_VAR rdfs:label NODE_VAR | 1390180 |
NODE_VAR wdt:P279 NODE_VAR | 1245462 |
gas:program gas:out1 NODE_VAR | 1242919 |
gas:program gas:out NODE_VAR | 1242919 |
gas:program gas:traversalDirection Forward | 1242594 |
gas:program gas:gasClass com.bigdata.rdf.graph.analytics.SSSP | 1242387 |
gas:program gas:linkType wdt:P279 | 1242332 |
gas:program gas:maxIterations 3^^http://www.w3.org/2001/XMLSchema#integer | 1242307 |
NODE_VAR NODE_VAR NODE_VAR | 943751 |
NODE_VAR <http://www.wikidata.org/prop/direct/P31>/(<http://www.wikidata.org/prop/direct/P279>)* wd:Q16521 | 677313 |
NODE_VAR schema:about NODE_VAR | 584918 |
bd:serviceParam wikibase:language [AUTO_LANGUAGE],en | 555352 |
NODE_VAR schema:isPartOf https://en.wikipedia.org/ | 312901 |
NODE_VAR wdt:P569 NODE_VAR | 289123 |
NODE_VAR wdt:P570 NODE_VAR | 283221 |
NODE_VAR wdt:P1630 NODE_VAR | 251418 |
NODE_VAR wikibase:propertyType NODE_VAR | 248225 |
NODE_VAR schema:name NODE_VAR | 210927 |
NODE_VAR wdt:P31 NODE_VAR | 207363 |
NODE_VAR wdt:P18 NODE_VAR | 150968 |
NODE_VAR pq:P6552 NODE_VAR | 136415 |
NODE_VAR p:P2002 NODE_VAR | 136376 |
NODE_VAR rdf:type wikibase:Property | 120507 |
NODE_VAR wikibase:claim NODE_VAR | 82803 |
NODE_VAR wdt:P856 NODE_VAR | 79010 |
NODE_VAR wikibase:statementProperty NODE_VAR | 78692 |
hint:Query hint:optimizer None | 68602 |
NODE_VAR schema:inLanguage en | 65903 |
NODE_VAR skos:altLabel NODE_VAR | 64923 |
NODE_VAR wdt:P577 NODE_VAR | 61687 |
NODE_VAR pq:P1545 NODE_VAR | 55542 |
NODE_VAR schema:isPartOf https://sv.wikipedia.org/ | 55440 |
NODE_VAR wdt:P282 wd:Q8229 | 50172 |
http://www.wikidata.org schema:dateModified NODE_VAR | 49241 |
NODE_VAR wdt:P21 NODE_VAR | 48106 |
NODE_VAR wdt:P50 NODE_VAR | 46583 |
NODE_VAR schema:description NODE_VAR | 46269 |
NODE_VAR wikibase:propertyType wikibase:ExternalId | 44854 |
NODE_VAR wdt:P31 wd:Q5 | 43986 |
NODE_VAR p:P179 NODE_VAR | 42521 |
NODE_VAR wdt:P300 NODE_VAR | 42304 |
bd:serviceParam wikibase:language fr,en,it,sp,de | 41919 |
NODE_VAR wdt:P227 NODE_VAR | 39279 |
NODE_VAR wdt:P136 NODE_VAR | 39038 |
NODE_VAR wdt:P27 NODE_VAR | 38119 |
NODE_VAR ps:P179 NODE_VAR | 36276 |
NODE_VAR wdt:P19 NODE_VAR | 33985 |
NODE_VAR wdt:P1843 NODE_VAR | 33459 |
NODE_VAR wdt:P106 NODE_VAR | 32173 |