Jump to content

User:AKhatun/Wikidata Subgraph Analysis

From Wikitech

TL;DR

  • All items and item statements that are instance of of X are considered subgraph of X. Some obvious subgraphs were merged into one.
  • 0.5% (~400) subgraphs form 90% of Wikidata (triples and items). Subgraph size visualization: subgraph stats
  • List of Top 50 subgraps: #Table_of_top_50_subgraph_information
  • The most common top predicates in the subgraphs are description, rdf:type, wikibase:rank, and references.
  • The growth rate of all subgraphs look consistent yet stable. None of them are growing at an unprecedented rate, but growing nonetheless. Growth rate visualization: subgraph_growth_rate
  • A visualization to show the connectivity between the top 50 subgraphs and how strong the connections are: wikidata_graph
    • Almost all subgraphs are connected to most other subgraphs in one way or other
    • Most connections are through triples
    • Subgraphs that share items are considered more connected (therefore, heavily weighted edges)
    • No isolated subgraph group or cluster was found in the top 50 subgraphs
    • The larger the subgraph gets, the more connections to other subgraphs it has
  • A list of the subgraph pairs that have the common items: nonzero_subgraph_pair_detail.csv. While most of the pairs seem reasonable, such as mountain-hill, some pairs seem odd, like village-human. This list can be used as a guide to solve some of the errors in Wikidata.

What are subgraphs?

Wikidata contains all kinds of data from various aspects of knowledge. All of these data are highly inter-connected, but we can find some patterns. We find subgraphs within Wikidata and find out how large these subgraphs are, how connected they are, and finally how much these subgraphs are used (queried).

In order to find subgraphs, the following steps were taken:

  • Consider all items that are instance of (P31) the same item to be under a subgraph. For example: all items that are instance of Q13442814 are part of one subgraph.
  • Some subgraphs were merged where it was obvious. For example: all subclasses of astronomical object were considered part of astronomical object as they were all indeed some sort of astronomical object. This method of sublcass merging is not applicable everywhere without manual inspection.
  • Some large subgraphs were almost completely part of another subgraph. For example: all items under Review Articles are also instance of scholarly article. In such case, review articles was not considered a separate subgraph.

Subgraph sizes

Using only instance of, Wikidata has 82,919 subgraphs. The distribution of the sizes of these subgraphs has a clear long tail, with very few subgraphs incorporating most items in Wikidata. Subgraph size can be calculated in two ways:

  • The number of items it contains
  • The number of triples related to the items in a subgraph. This is what we refer as subgraph size from here on.

Takeaways:

  • Most calculations from here on will take the top 50 subgraphs, which form 85% of Wikidata
  • 340 top subgraphs (0.5% of all subgraphs, after merging some) form 90% of Wikidata (91% of all items and 90% of all triples). These subgraphs have >=10,000 items each.
  • Rest 99.5% of the subgraphs have <10,000 items each, and together form 10% of Wikidata.

Below is the distribution of the number of items in a subgraphs.

To be more specific,

Subgraph item distribution
Number of subgraphs Number of items
There are 54,602 subgraph(s) with more than 1 item(s)
23,724 10
6,625 100
1,712 1,000
392 10,000
63 100,000
10 1,000,000
1 10,000,000

Below is the subgraph size comparison of top 340 subgraphs in Wikidata (90%).

Below is the subgraph size comparison of top 50 subgraphs in Wikidata (85%).

Here is an interactive graph showing the comparison of subgraph sizes in terms of item count and triple count: subgraph stats.

Here are some subgraph size visualizations in WDQS:

  • Size as percentage of Wikidata each subgraph occupies: query link
  • Size as percentage of Wikidata items each subgraph contains: query link

Number of days to recovery Given the current rate of growth, how long would it take wikidata to get back to its original size again if some amount of triples were removed from it? This helps us estimate what to temporarily remove from Wikidata in the siatuation Wikidata backend maxes out. The growth rate of triples is not constant, but considering the growth an approximate straight line, in grafana dashboard, Wikidata grows at a rate of 4.77M triples per day. This rate was calculated from the number of triples at the start and end of a 90-day interval (11/3/21 to 6/6/21). It could be faster or a bit slower than this. This will give us a wide approximation of the number of days we can gain by removing some parts of Wikidata.

Triples

The triples within a subgraph can be of various types. They can be:

  • truthy triples like wdt
  • non-wikidata direct triples like rdfs:label, schema:name etc
  • full statements that hold other information

See more about these statement types here: RDF_Dump_Format#Statement_types.

Table of top 50 subgraph information

Top 50 Subgraphs in Wikidata
Rank Subgraph Subgraph Name Number of items % of WD items Number of triples % of WD Triples Number of days to recover %of truthy statements %of non-wikidata direct statements %of full statements Number of unique properties
1 Q13442814 scholarly article 37,362,641 39.75 6,539,020,889 49.73 1370.86 12.62 24.28 63.25 722
2 Q6999 astronomical object 8,412,914 8.95 1,136,682,291 8.64 238.3 10.20 14.24 76.07 578
3 Q5 human 9,315,444 9.91 954,536,943 7.26 200.11 13.31 20.06 60.94 4482
4 Q4167836 Wikimedia category 4,840,195 5.15 753,127,982 5.73 157.89 1.19 86.06 5.19 610
5 Q16521 taxon 3,180,248 3.38 367,926,462 2.8 77.13 10.11 37.22 42.97 963
6 Q101352 family name 481,445 0.51 187,299,892 1.42 39.27 1.59 93.49 6.62 375
7 Q4167410 Wikimedia disambiguation page 1,359,804 1.45 180,124,174 1.37 37.76 0.89 88.32 3.83 796
8 Q7187 gene 1,196,361 1.27 122,421,508 0.93 25.66 14.13 12.51 73.11 218
9 Q11266439 Wikimedia template 845,852 0.9 114,308,711 0.87 23.96 0.78 87.06 3.20 222
10 Q11173 chemical compound 1,223,387 1.3 91,228,463 0.69 19.13 12.69 35.72 50.86 591
11 Q8054 protein 986,599 1.05 88,483,828 0.67 18.55 14.26 13.51 72.21 267
12 Q3305213 painting 539,468 0.57 56,769,083 0.43 11.9 12.10 24.46 63.35 785
13 Q13100073 village-level division in China 588,477 0.63 51,615,572 0.39 10.82 6.84 62.73 30.41 77
14 Q11424 film 263,070 0.28 47,176,067 0.36 9.89 14.37 12.74 66.01 1029
15 Q486972 human settlement 563,958 0.6 39,590,792 0.3 8.3 10.84 22.94 49.32 1120
16 Q13406463 Wikimedia list article 334,939 0.36 33,742,245 0.26 7.07 2.45 78.06 10.48 880
17 Q13433827 encyclopedia article 512,141 0.55 33,373,227 0.25 7.0 9.03 45.52 39.76 164
18 Q8502 mountain 525,553 0.56 33,340,188 0.25 6.99 11.37 27.79 50.39 709
19 Q2668072 collection 500,968 0.53 32,670,637 0.25 6.85 15.24 12.30 72.71 665
20 Q79007 street 578,926 0.62 30,252,119 0.23 6.34 13.86 24.24 59.88 572
21 Q4022 river 399,552 0.42 28,833,476 0.22 6.04 11.34 24.99 52.15 580
22 Q30612 clinical trial 356,838 0.38 27,731,502 0.21 5.81 16.05 12.33 71.78 124
23 Q532 village 274,840 0.29 26,483,275 0.2 5.55 9.69 26.25 45.22 818
24 Q17633526 Wikinews article 286,950 0.3 21,830,150 0.17 4.58 3.68 72.76 16.46 256
25 Q482994 album 269,095 0.29 21,181,015 0.16 4.44 12.20 22.64 53.05 638
26 Q23397 lake 260,135 0.28 18,053,096 0.14 3.78 11.32 25.43 53.19 602
27 Q54050 hill 327,277 0.35 17,228,390 0.13 3.61 12.67 23.07 56.77 470
28 Q16970 church building 211,291 0.22 16,821,530 0.13 3.53 14.03 15.06 63.56 1036
29 Q41176 building 265,925 0.28 16,293,008 0.12 3.42 14.21 14.99 68.06 1243
30 Q56436498 village in India 145,824 0.16 15,383,416 0.12 3.23 8.34 22.23 64.50 210
31 Q4830453 business 193,858 0.21 14,101,220 0.11 2.96 13.06 16.23 60.48 1790
32 Q47150325 calendar day of a given year 189,366 0.2 14,078,486 0.11 2.95 6.85 56.25 34.18 56
33 Q3947 house 197,736 0.21 12,468,434 0.1 2.61 15.26 15.18 70.67 760
34 Q3331189 version, edition, or translation 157,486 0.17 10,997,589 0.08 2.31 15.41 14.70 69.39 1006
35 Q18593264 item of collection or exhibition 147,402 0.16 10,732,969 0.08 2.25 16.96 9.69 73.38 306
36 Q27020041 sports season 158,877 0.17 10,693,504 0.08 2.24 12.11 15.12 53.43 353
37 Q355304 watercourse 174,620 0.19 10,080,421 0.08 2.11 12.17 25.07 54.79 302
38 Q7725634 literary work 164,860 0.18 10,049,521 0.08 2.11 13.85 16.69 58.88 1093
39 Q23442 island 148,587 0.16 9,885,277 0.08 2.07 11.40 22.52 50.06 829
40 Q11060274 print 119,806 0.13 9,700,063 0.07 2.03 14.85 9.74 76.96 269
41 Q811979 architectural structure 145,957 0.16 9,666,936 0.07 2.03 10.06 9.91 52.63 994
42 Q5084 hamlet 118,188 0.13 9,013,534 0.07 1.89 10.94 18.63 55.55 423
43 Q9842 primary school 157,451 0.17 8,916,373 0.07 1.87 13.99 16.22 68.98 410
44 Q19389637 biographical article 151,026 0.16 8,238,397 0.06 1.73 12.76 21.07 57.51 131
45 Q21014462 cell line 128,805 0.14 7,955,975 0.06 1.67 10.62 36.18 54.60 60
46 Q47521 stream 124,853 0.13 6,654,366 0.05 1.4 12.95 19.67 58.11 280
47 Q59199015 group of stereoisomers 111,599 0.12 5,843,270 0.04 1.23 15.43 18.82 67.23 216
48 Q61443690 branch post office 129,183 0.14 5,313,033 0.04 1.11 14.59 14.86 70.54 22
49 Q49008 prime number 127,545 0.14 5,188,768 0.04 1.09 10.01 36.78 52.40 101
50 Q4164871 position 120,117 0.13 4,720,668 0.04 0.99 12.72 32.76 52.28 654

Triples per item

While it is interesting to see how big a subgraph is and how many items it has, it is helpful to know how many triples each item has typically in a given subgraph. A very basic idea can be gained from density of a subgraph, where density = #of triples / #of items, in subgraph stats. Below is a diagram of box plot showing triples per item distribution for the top 50 subgraphs. The box plot omits min and max values, shows only mean, median, Q1, and Q3.

Predicates

Predicates are used in all subgraph. Sometimes some predicates are almost exclusively used in a particular subgraphs, other times a predicate may be used 99% of times in that particular subgraph. Moreover, the unique predicates used in a subgraph can inform us of the range of diverse statements a subgraph contains. These and some more analysis were done on predicates below.

Number of unique predicates

The number of unique predicates a subgraph uses has been listed in the table above (#Table of top 50 subgraph information). Feel free to sort by property column to view the most/least diverse subgraph.

Predicate distribution

There are ~7500 unique predicates across the top 50 subgraphs. Among them ~3500(46%) are used in any 1 subgraph only. Below is a figure showing this distribution.

Here is a csv file to the subgraph-predicate count: subgraph_pred_df_info.csv.

Top predicates

While it is interesting to see the top predicates for each subgraph, it is too much to view for this page. Below is a table of only the top 3 predicates per subgraph. Here is a csv file with the top 5 predicates per subgraph: top_subgraph_pred.csv. You can view more from subgraph_pred_df_info.csv with some filtering and grouping.

Note that:

  • The most common top predicates are description, rdf:type, wikibase:rank, and references.
  • Only scholarly articles' descriptions are 10% of Wikidata, with cites work and wikibase:rank being ~6%.
  • Wikimedia category descriptions are 4.5%, and the rest is ~1% or less of Wikidata triples.
Top predicates per subgraph and their triple distribution
1st top predicate 2nd top predicate 3rd top predicate
Subgraph Predicate #of triples %triples in subgraph %triples in Wikidata Predicate #of triples %triples in subgraph %triples in Wikidata Predicate #of triples %triples in subgraph %triples in Wikidata
Wikimedia category description 596672076 79.226 4.517 22-rdf-syntax-ns#type 22761531 3.022 0.172 rdf-schema#label 15094771 2.004 0.114
Wikimedia disambiguation page description 112264922 62.326 0.85 rdf-schema#label 39587180 21.978 0.3 22-rdf-syntax-ns#type 4146198 2.302 0.031
Wikimedia list article description 23714609 70.282 0.18 22-rdf-syntax-ns#type 1449015 4.294 0.011 instance of 1022424 3.03 0.008
Wikimedia template description 93286385 81.609 0.706 22-rdf-syntax-ns#type 2918907 2.554 0.022 instance of 2557094 2.237 0.019
Wikinews article description 14128436 64.72 0.107 22-rdf-syntax-ns#type 1114921 5.107 0.008 instance of 862083 3.949 0.007
album 22-rdf-syntax-ns#type 2793067 13.187 0.021 ontology#rank 2250457 10.625 0.017 rdf-schema#label 1871641 8.836 0.014
architectural structure ontology#rank 1290969 13.354 0.01 22-rdf-syntax-ns#type 1289149 13.336 0.01 prov#wasDerivedFrom 725555 7.506 0.005
astronomical object ontology#rank 144578828 12.719 1.095 prov#wasDerivedFrom 128331955 11.29 0.972 22-rdf-syntax-ns#type 117137727 10.305 0.887
biographical article 22-rdf-syntax-ns#type 1212529 14.718 0.009 ontology#rank 1061395 12.884 0.008 description 532327 6.462 0.004
branch post office 22-rdf-syntax-ns#type 775436 14.595 0.006 ontology#rank 775436 14.595 0.006 prov#wasDerivedFrom 645628 12.152 0.005
building 22-rdf-syntax-ns#type 2372760 14.563 0.018 ontology#rank 2262925 13.889 0.017 prov#wasDerivedFrom 1130158 6.936 0.009
business 22-rdf-syntax-ns#type 1948087 13.815 0.015 ontology#rank 1669295 11.838 0.013 prov#wasDerivedFrom 919184 6.518 0.007
calendar day of a given year rdf-schema#label 6152052 43.698 0.047 instance of 1146460 8.143 0.009 22-rdf-syntax-ns#type 1040850 7.393 0.008
cell line rdf-schema#label 1060304 13.327 0.008 description 1042089 13.098 0.008 22-rdf-syntax-ns#type 833878 10.481 0.006
chemical compound description 24753956 27.134 0.187 22-rdf-syntax-ns#type 10660602 11.686 0.081 ontology#rank 10484519 11.493 0.079
church building 22-rdf-syntax-ns#type 2541716 15.11 0.019 ontology#rank 2256786 13.416 0.017 prov#wasDerivedFrom 972599 5.782 0.007
clinical trial 22-rdf-syntax-ns#type 4453735 16.06 0.034 ontology#rank 4453642 16.06 0.034 minimum age 1595252 5.752 0.012
collection 22-rdf-syntax-ns#type 5206241 15.936 0.039 ontology#rank 5204557 15.93 0.039 prov#wasDerivedFrom 2150646 6.583 0.016
encyclopedia article description 11968615 35.863 0.091 22-rdf-syntax-ns#type 3302636 9.896 0.025 ontology#rank 2913370 8.73 0.022
family name description 59885567 31.973 0.453 rdf-schema#label 57692163 30.802 0.437 core#altLabel 51377678 27.431 0.389
film 22-rdf-syntax-ns#type 7284888 15.442 0.055 ontology#rank 6480184 13.736 0.049 prov#wasDerivedFrom 3597645 7.626 0.027
gene prov#wasDerivedFrom 16328401 13.338 0.124 22-rdf-syntax-ns#type 16046507 13.108 0.121 ontology#rank 15992810 13.064 0.121
group of stereoisomers 22-rdf-syntax-ns#type 874827 14.972 0.007 ontology#rank 871313 14.911 0.007 found in taxon 612204 10.477 0.005
hamlet 22-rdf-syntax-ns#type 1222418 13.562 0.009 ontology#rank 924362 10.255 0.007 prov#wasDerivedFrom 813808 9.029 0.006
hill 22-rdf-syntax-ns#type 2215050 12.857 0.017 ontology#rank 1857312 10.781 0.014 GeoNames ID 1568224 9.103 0.012
house 22-rdf-syntax-ns#type 1893455 15.186 0.014 ontology#rank 1825282 14.639 0.014 coordinate location 746416 5.986 0.006
human 22-rdf-syntax-ns#type 127210431 13.327 0.963 ontology#rank 115226473 12.071 0.872 rdf-schema#label 82571873 8.65 0.625
human settlement 22-rdf-syntax-ns#type 5214907 13.172 0.039 ontology#rank 3840166 9.7 0.029 description 3222856 8.14 0.024
island 22-rdf-syntax-ns#type 1282293 12.972 0.01 ontology#rank 958245 9.694 0.007 description 897631 9.08 0.007
item of collection or exhibition 22-rdf-syntax-ns#type 1821311 16.969 0.014 ontology#rank 1820436 16.961 0.014 part of 712209 6.636 0.005
lake description 2406074 13.328 0.018 22-rdf-syntax-ns#type 2199894 12.186 0.017 ontology#rank 1813415 10.045 0.014
literary work 22-rdf-syntax-ns#type 1478965 14.717 0.011 ontology#rank 1252618 12.464 0.009 instance of 617093 6.141 0.005
mountain description 4367836 13.101 0.033 22-rdf-syntax-ns#type 4046473 12.137 0.031 ontology#rank 3238261 9.713 0.025
painting description 10311414 18.164 0.078 22-rdf-syntax-ns#type 6848409 12.064 0.052 ontology#rank 6791199 11.963 0.051
position 22-rdf-syntax-ns#type 616685 13.064 0.005 ontology#rank 580022 12.287 0.004 rdf-schema#label 544323 11.531 0.004
primary school 22-rdf-syntax-ns#type 1252277 14.045 0.009 ontology#rank 1236386 13.866 0.009 prov#wasDerivedFrom 1008762 11.314 0.008
prime number description 795712 15.335 0.006 ontology#rank 644583 12.423 0.005 22-rdf-syntax-ns#type 526967 10.156 0.004
print 22-rdf-syntax-ns#type 1425563 14.696 0.011 ontology#rank 1425199 14.693 0.011 prov#wasDerivedFrom 1078309 11.117 0.008
protein prov#wasDerivedFrom 12427165 14.045 0.094 22-rdf-syntax-ns#type 11368057 12.848 0.086 ontology#rank 11357567 12.836 0.086
river description 3651037 12.662 0.028 22-rdf-syntax-ns#type 3567310 12.372 0.027 ontology#rank 2829316 9.813 0.021
scholarly article description 1324177494 20.25 10.025 cites work 853611996 13.054 6.462 ontology#rank 796548851 12.181 6.03
sports season 22-rdf-syntax-ns#type 1572731 14.707 0.012 ontology#rank 1156720 10.817 0.009 instance of 496113 4.639 0.004
stream 22-rdf-syntax-ns#type 873978 13.134 0.007 ontology#rank 734964 11.045 0.006 GeoNames ID 587699 8.832 0.004
street 22-rdf-syntax-ns#type 4236711 14.005 0.032 ontology#rank 4096946 13.543 0.031 description 2256816 7.46 0.017
taxon rdf-schema#label 69840848 18.982 0.529 description 45308808 12.315 0.343 22-rdf-syntax-ns#type 40988244 11.14 0.31
version, edition, or translation 22-rdf-syntax-ns#type 1591714 14.473 0.012 ontology#rank 1538937 13.993 0.012 prov#wasDerivedFrom 712852 6.482 0.005
village 22-rdf-syntax-ns#type 3307961 12.491 0.025 description 3212844 12.132 0.024 ontology#rank 2304145 8.7 0.017
village in India description 2336722 15.19 0.018 ontology#rank 1467450 9.539 0.011 22-rdf-syntax-ns#type 1392224 9.05 0.011
village-level division in China description 24717636 47.888 0.187 rdf-schema#label 4720063 9.145 0.036 22-rdf-syntax-ns#type 3533117 6.845 0.027
watercourse 22-rdf-syntax-ns#type 1256378 12.464 0.01 ontology#rank 1039967 10.317 0.008 description 1009377 10.013 0.008

Predicates across subgraphs

From the predicates point of view: how widely are they used? We already know some predicates are used in 1 subgraph. What about the others? Following is a diagram showing the usage of the 60 most used predicates in Wikidata across various subgraphs. Note that the usage were calculated only for the top 50 subgraphs, which account for ~85% of Wikidata. So this should give us an idea of the high use cases of each of these predicates. The rest can be considered a long tail to each of these plots.

The x-axis shows the rank of the subgraph instead of the name to save space. The rank-name mapping can be found in #Table of top 50 subgraph information. The figures are large but can be clicked and zoomed in without loss of resolution for better viewing.

Predicate usage in log scale

Below is the same image in logarithmic scale to show the wide differences in usage counts across subgraphs where applicable.

Top usage

As mentioned above, we can isolate some predicates that are used >=99% in a particular subgraph. Some are even used 100% of the times in that particular subgraph. The following graph shows the distribution of the highest percentage usage in a subgraph a predicate has.

For the predicates that are used a lot in a particular subgraph, it is possible that it is used a very small number of times in other subgraphs (second max usage count is low) or it is also used a lot in other subgraphs (second max usage count is quite high). In short: we also want to look at second max percentages.

Here is an interactive plot showing the max percent, second max percent, color coded with the number of subgraphs the predicate is used in: predicate_usage

Rate of growth of subgraphs

Here is an interactive chart showing the growth of the top 50 subgraphs over a period of one month: subgraph_growth_rate

Subgraph Connectivity

Wikidata is very inter-connected. Although some analysis is possible for individual graphs, it is also important to look into how connected these subgraphs are. In fact, Wikidata is a whole graph in and of itself. For the purposes of analysis, we defined subgraphs, otherwise, Wikidata (as of now) does not distinguish among these subgraphs in ways.

To define connectivity among subgraphs, we consider not only if they are directly connected, but also if they share some properties and items. These components of connectivity are described below. A visualization was created to show the strength of this connectivity between subgraphs here: wikidata_graph. This also shows further information about the subgraphs themselves and what connects them.

Through items

Items in Wikidata can be instance of (P31) multiple entities. If an item is P31 mountain and also P31 hill, we can assume mountains and hills are closely related. Following this assumption, we find the number of items that each pair of subgraph has in common. This helps us find similar subgraphs in Wikidata. A 50 X 50 table containing the number of items each pair of subgraph has in common is available here: subgraph_pair_items.csv. This can also be seen as an information on the edges in the subgraph visualization linked above.

Out of the ~80M items that fall under the top 50 subgraphs:

  • 40,232 items are instance of 2 entities (in the top 50)
  • 57 items are instance of 3 entities (in the top 50)
  • And the rest are instance of only 1 entity
  • There are examples of items that are instance of more than 3 entities, but not all of those entities are part of the top 50 subgraphs.

Following is a list of the subgraph pairs that have the most common items. The full list is given in nonzero_subgraph_pair_detail.csv. While most of the pairs seem reasonable, such as mountain-hill and river-watercourse, some pairs seem odd. This can also be confirmed by the small number of items the two subgraphs have in common. For example, it is odd for the same item to be a village and a river, or a human and a scholarly article, especially since out of the thousands of items these subgraphs have, only 6-10 items are shared. This list can be used as a guide to solve some of the errors (or check if they in fact are errors) in Wikidata.

Subgraph pairs with the most number of common items
subgraph1 subgraph2 #of items in subgraph1 #of items in subgraph2 #of shared items %of subgraph1 items shared %of subgraph2 items shared
hill mountain 327277 525553 37102 11.34 7.06
river watercourse 399552 174620 14972 3.75 8.57
hamlet village 118188 274840 5282 4.47 1.92
hamlet human settlement 118188 563958 3120 2.64 0.55
human settlement village 563958 274840 2986 0.53 1.09
church building architectural structure 211291 145957 2622 1.24 1.8
river stream 399552 124853 2204 0.55 1.77
version, edition, or translation literary work 157486 164860 1304 0.83 0.79
building architectural structure 265925 145957 946 0.36 0.65
building house 265925 197736 942 0.35 0.48

Through properties

Some properties can be used across Wikidata, while others are exclusive to some types of items. We get an idea about predicate usage in #Predicates across subgraphs section. Finding pairs of subgraphs that use a lot of the same properties forms a connection between them. If a pair of subgraphs has a lot of properties in common, it is possible for entities to show up in queries that use these properties, or at least the query may touch on those entities. A 50 X 50 table containing the number of predicates each pair of subgraph has in common is available here: subgraph_pair_predicates.csv, and a heatmap for the same is shown below.

Through triples

Another component of subgraph connectivity is direct connection among the triples across subgraphs. For example:

SELECT ?article ?author
WHERE
{
  ?article wdt:P31 wd:Q13442814.
  ?author wdt:P31 wd:Q5.
  ?article wdt:P50 ?author.
}
LIMIT 10

This query contains triple that connects the scholarly article subgraph with the human subgraph through P50, this is because ?article is instance of scholarly article, whereas ?author is instance of human. The connection here is from scholarly article subgraph to human subgraph.

This connection is defined as such: If the object of a triple that belong to subgraph A, is instance of subgraph B, there is a connection between subgraph A and B

Some of the top subgraph pairs with highest number of triple connection are shown below. Note that subgraphs have a lot of connections to themselves but is not shown here. More information about these subgraphs connections can be found in the subgraph visualization: wikidata_graph.

Subgraph pairs with most direct triple connections (directed edges)
From subgraph To Subgraph Triple count
scholarly article human 1813767
protein gene 938854
gene protein 937527
human family name 321203
film human 229953
scholarly article chemical compound 173924
Wikimedia category taxon 156468
taxon Wikimedia category 156440
Wikimedia disambiguation page family name 131914
family name Wikimedia disambiguation page 127352

A 50X50 chart of all the subgraph pairs and number of directed connections (triple from subgraph A to B is different from triples from subgraph B to A) between them can be found here: subgraph_pair_triple_connections.csv. A logarithmic scale heatmap of this chart is shown below.