Data Platform/Data Lake/Edits/Mediawiki project namespace map
The wmf_raw.mediawiki_project_namespace_map table (available on Hive) contains project and namespace data for every project referenced in the Wikimedia sitematrix. It is generated by querying the sitematrix API for the project-list, and then for each project querying the project API for its site-info data.
The dataset is fully regenerated on the first of every month.
TODO: Merge this dataset within the canonical_data hive database.
Current Schema
$ hive --database wmf_raw hive (wmf)> describe mediawiki_project_namespace_map; OK col_name data_type comment hostname string Canonical URL for the project, for example ja.wikipedia.org dbname string Database name for the project, for example jawiki namespace int for example 0, 100, etc. namespace_canonical_name string the english prefix if exists, otherwise the localized prefix namespace_localized_name string the localized prefix namespace_is_content int Whether this namespace is a content namespace snapshot string Versioning information to keep multiple datasets (YYYY-MM for regular labs imports) # Partition Information # col_name data_type comment snapshot string Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)
Notice the snapshot field. It is a Hive partition, an explicit mapping to weekly import in HDFS. You must include this partition predicate in the where clause of your queries (even if it is just snapshot > '0'). Partitions allow you to reduce the amount of data that Hive must parse and process before it returns you results. For example, if are only interested in the 2020-01-20 snaphsot, you should add where snapshot = '2020-01-20'. This will instruct Hive to only process data for partitions that match that partition predicate. You may use partition fields as you would any normal field, even though the field values are not actually stored in the data files.
Changes and known problems since 2019-08
Date from | Task | Details |
---|---|---|
2017-?? | ?? | Table is created with first automated snapshots (doc created on 2020-02) |
See also
- The code that generates the dataset: