Data Platform/Data Lake/Content/Mediawiki wikitext history
wmf.mediawiki_wikitext_history
is a dataset available in the Data Lake that provides the full content of all revisions, past and present, from Wikimedia wikis (except Wikidata).
The content is stored as unparsed Wikitext. Each monthly snapshot should arrive between the 10th and 12th of the following month.
Wikidata is excluded to reduce the total latency of the dataset from about 23 days to about 11. This shouldn't be a problem, since it is strongly recommended not to use its XML dumps.
Schema
Note: The snapshot
and wiki_db
fields are Hive partitions. They explicitly map to snapshot folders in HDFS. Since the full data is present in every snapshot up to the latest snapshot date, you should always pick a single snapshot in the where
clause of your query.
col_name | data_type | comment |
---|---|---|
page_id | bigint | id of the page |
page_namespace | int | namespace of the page |
page_title | string | title of the page |
page_redirect_title | string | title of the redirected-to page |
page_restrictions | array<string> | restrictions of the page |
user_id | bigint | id of the user that made the revision (or -1 if anonymous) |
user_text | string | text of the user that made the revision (either username or IP) |
revision_id | bigint | id of the revision |
revision_parent_id | bigint | id of the parent revision |
revision_timestamp | string | timestamp of the revision (ISO8601 format) |
revision_minor_edit | boolean | whether this revision is a minor edit or not |
revision_comment | string | Comment made with revision |
revision_text_bytes | bigint | bytes number of the revision text |
revision_text_sha1 | string | sha1 hash of the revision text |
revision_text | string | text of the revision |
revision_content_model | string | content model of the revision |
revision_content_format | string | content format of the revision |
snapshot | string | Versioning information to keep multiple datasets (YYYY-MM for regular imports) |
wiki_db | string | The wiki_db project |
Changes and known problems
Date | Phab
Task |
Snapshot version | Details |
---|---|---|---|
2024-03-01 | task T357859 | 2024-02 | Wikidata is now excluded in order to dramatically speed up the pipeline. |
2019-11-01 | task T236687 | 2019-10 | Change underlying file format from parquet to avro to prevent memory issues at read time.
|
2018-09-01 | task T202490 | 2018-09 | Creation of the table. Data starts to flow regularly (every month). |
Pipeline
- The
pages-meta-history
public XML data dumps. The bottleneck is the English Wikipedia dump, which finishes between the 7th and the 9th (Wikidata generally takes until the 19th, which is why it is split into a separate job (wikidata_wikitext_history). - A Puppet-managed SystemD timer runs a Python script that imports the XML dump files into HDFS, in folders following the pattern
hdfs:///wmf/data/raw/mediawiki/dumps/pages_meta_history/YYYYMMDD/WIKI_DB
. Wikidata is excluded from this step. - An Airflow job refines the XML dumps into Avro data, stored in folders with the pattern:
hdfs:///wmf/data/wmf/mediawiki/wikitext/history/snapshot=YYYY-MM/wiki_db=WIKI_DB
Note that there is a one month difference between the snapshot value of the Avro-converted data and the raw XML data. This is because our convention in the Data Lake is that the date tells what data is available (for instance, 2019-11 means that data for 2019-11 is present), while with the dumps, the date tells when the dump process started (for instance, 20191201 means the dump started on 2019-12-01, so it will have data for 2019-11 but not 2019-12).
The data is stored in Avro format rather than Parquet to prevent memory errors due to vectorized columnar reading in Parquet.