Data Platform/Data Lake/Content/Wikidata item page link
The wmf.wikidata_item_page_link table (available on Hive) contains links between a wikidata item and its related wikipedia pages in various languages. It is generated by joining wmf.wikidata_entity site-links to reconstructed localized-namespaced-page-titles from wmf.mediawiki_page_history and wmf_raw.mediawiki_project_namespace_map.
The dataset is fully regenerated every week, following the wmf.wikidata_entity dataset snapshot release.
WARNING: The wmf.mediawiki_page_history dataset being released every month, there is a divergence between sitelinks available and pages available as the month pass. This means links correctness is better when using snapshots early in the month than later in the month.
Current Schema
$ hive --database wmf hive (wmf)> describe wikidata_item_page_link; OK col_name data_type comment item_id string The wikidata item_id (Q32753077 for instance) wiki_db string The db project of the page the wikidata item links to page_id bigint The id of the page the wikidata item links to page_title string The title of the page the wikidata item links to page_namespace int The namespace of the page the wikidata item links to page_title_localized_namespace string The title with localized namespace header of the page the wikidata item links to snapshot string Versioning information to keep multiple datasets (YYYY-MM-DD for regular weekly imports) # Partition Information # col_name data_type comment snapshot string Versioning information to keep multiple datasets (YYYY-MM-DD for regular weekly imports)
Notice the snapshot field. It is a Hive partition, an explicit mapping to weekly import in HDFS. You must include this partition predicate in the where clause of your queries (even if it is just snapshot > '0'). Partitions allow you to reduce the amount of data that Hive must parse and process before it returns you results. For example, if are only interested in the 2020-01-20 snaphsot, you should add where snapshot = '2020-01-20'. This will instruct Hive to only process data for partitions that match that partition predicate. You may use partition fields as you would any normal field, even though the field values are not actually stored in the data files.
And entry (like, say, Q1) that has a presence in a number of wikis will have one entry per wiki on this table.
Sample queries
SELECT
wiki_db,
page_id,
COUNT(1) as items_numbers
FROM wmf.wikidata_item_page_link
WHERE snapshot='2020-02-10'
GROUP BY wiki_db, page_id
ORDER BY items_numbers DESC
LIMIT 10;
Changes and known problems since 2020-02
Date from | Task | Details |
---|---|---|
2020-02 | task T244707 | Table is created with first automated snapshots. |