Data Platform/Data Lake/Edits/Mediawiki page history
This page describes the data set that stores the page history of WMF's wikis. It lives in Analytic's Hadoop cluster and is accessible via the Hive/Beeline external table wmf.mediawiki_page_history
. For more detail of the purpose of this data set, please read Analytics/Data Lake/Page and user history reconstruction. Also visit Analytics/Data access if you don't know how to access this data set.
Schema
col_name | data_type | comment |
---|---|---|
wiki_db | string | enwiki, dewiki, eswiktionary, etc. |
page_id | bigint | Id of the page, as in the page table. |
page_artificial_id | string | Generated Id for deleted pages without real Id. |
page_creation_timestamp | string | Creation timestamp of the page. |
page_first_edit_timestamp | string | Timestamp of the page\'s first revision (can be before page_creation in restore/merge cases). |
page_title_historical | string | Historical page title, with spaces replaced by underscores. |
page_title | string | Page title as of today, with spaces replaced by underscores. |
page_namespace_historical | int | Historical namespace. |
page_namespace_is_content_historical | boolean | Whether the historical namespace is categorized as content |
page_namespace | int | Namespace as of today. |
page_namespace_is_content | boolean | Whether the current namespace is categorized as content |
page_is_redirect | boolean | In revision/page events: whether the page is currently a redirect |
page_is_deleted | boolean | Whether the page is rebuilt from a delete event |
start_timestamp | string | Timestamp from where this state applies (inclusive). |
end_timestamp | string | Timestamp to where this state applies (exclusive). |
caused_by_event_type | string | Event that caused this state (create, move, delete or restore). |
caused_by_user_id | bigint | ID from the user that caused this state. |
caused_by_user_text | string | Name of the user that caused this state |
caused_by_anonymous_user | boolean | Whether the user that caused this state was anonymous |
inferred_from | string | If non-NULL, some fields have been inferred from an inconsistency in the source data. |
source_log_id | bigint | ID of the logging table row that caused this state |
source_log_comment | string | Comment of the logging table row that caused this state |
source_log_params | map<string,string> | Parameters of the logging table row that caused this state, parsed as a map |
snapshot | string | Versioning information to keep multiple datasets (YYYY-MM for regular labs imports) |
Note the snapshot
field: It is a Hive partitions. It explicitly maps to snapshot folders in HDFS. Since the full data is present in every snapshot up to the snapshot date, you should always specify a snapshot partition predicate in the where
clause of your queries.
FAQ
Changes and known problems
Snapshot
or date |
Details | Phab
Task |
---|---|---|
2019-07 | Schema changes: Addition of caused_by_anonymous_user and page_first_edit_timestamp .
|
task T221825 |
2019-04 | Schema changes (no breaking change, only new fields): Addition of page_is_deleted , caused_by_user_text , source_log_id , source_log_comment , source_log_params .
Change in how delete/restore are handled: restore was supposed to always create a new page_id, when it actually doesn't - It either restores a page that was deleted if no page is present with the given title, or do nothing if a page already exist with the given title (restore-into --> merge revisions from a previously deleted page with the given title into an existing page). |
task T221824 |
2017-11 | For pairs of fields that give current and historical versions of a value, rename the fields so that _historical is appended to the historical field rather than _latest to the current one.
|
|
2016/10/06 | The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis. | |
2017/03/01 | Add the snapshot partition, allowing to keep multiple versions of the page history. Data starts to flow regularly (every month) from labs.
|