Jump to content

Data Platform/Data Lake/Edits/Mediawiki project namespace map

From Wikitech

The wmf_raw.mediawiki_project_namespace_map table (available on Hive) contains project and namespace data for every project referenced in the Wikimedia sitematrix. It is generated by querying the sitematrix API for the project-list, and then for each project querying the project API for its site-info data.

The dataset is fully regenerated on the first of every month.

TODO: Merge this dataset within the canonical_data hive database.

Current Schema

$ hive --database wmf_raw

hive (wmf)> describe mediawiki_project_namespace_map;
OK
col_name	data_type	comment
hostname            	string              	Canonical URL for the project, for example ja.wikipedia.org
dbname              	string              	Database name for the project, for example jawiki
namespace           	int                 	for example 0, 100, etc.
namespace_canonical_name	string              	the english prefix if exists, otherwise the localized prefix
namespace_localized_name	string              	the localized prefix
namespace_is_content	int                 	Whether this namespace is a content namespace
snapshot            	string              	Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment              
snapshot            	string              	Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)

Notice the snapshot field. It is a Hive partition, an explicit mapping to weekly import in HDFS. You must include this partition predicate in the where clause of your queries (even if it is just snapshot > '0'). Partitions allow you to reduce the amount of data that Hive must parse and process before it returns you results. For example, if are only interested in the 2020-01-20 snaphsot, you should add where snapshot = '2020-01-20'. This will instruct Hive to only process data for partitions that match that partition predicate. You may use partition fields as you would any normal field, even though the field values are not actually stored in the data files.

Changes and known problems since 2019-08

Date from Task Details
2017-?? ?? Table is created with first automated snapshots (doc created on 2020-02)

See also