Data Platform/Data Lake/Edits/MediaWiki history dumps
This page describes the data set dump of the denormalized revision, user and page history of all WMF's wikis. It is computed from the MediaWiki History dataset, lives in the Analytics Hadoop cluster, and is downloadable from MediaWiki Dumps. A new monthly snapshot containing all history is being produced at the beginning of each month.
General Information
Content
This data set contains a historical record of revision (without text), user and page events of Wikimedia wikis since 2001. The data is denormalized, meaning that all events for user, page and revision are stored in the same schema. This leads to some fields being always null for some events (for instance fields about page are null in events about user). Events about users and pages have been processed to rebuild an as coherent as possible history in term of user-renames and page-moves (see Page and user history reconstruction). Also, some data have been preprocessed to facilitate analyses, such as edit-count per user and per page, reverting and reverted revisions and more.
Updates
The updates for this data set are monthly, around the end of the month's first week. Each update contains a full dump since 2001 (the beginning of MediaWiki-time) up to the current month. The reason for this particularity is the underlying data, the MediaWiki databases. Every time a user gets renamed, a revision reverted, a page moved, etc. the existing related records in the logging table are updated accordingly. So an event triggered today may change the state of that table 10 years ago. And it turns out the logging table is the base of the MediaWiki history reconstruction process. Thus, note that incremental downloads of these dumps may generate inconsistent data. Consider using [[1]] for real time updates on MediaWiki changes(API docs).
Versioning
Each update receives the name of the last featured month, in YYYY-MM format. For example if the dump spans from 2001 to August 2019 (included), it will be named 2019-08 even if it will be released on the first days of September 2019. There is a folder for each available month at the root of the download URL, and for storage reasons only the last two versions are available. This shouldn't be problematic as every version contains the whole historical dataset.
Partitioning
The data is organized by wiki and time range. This way it can be downloaded for a single wiki (or set of wikis). The time split is necessary because of file size reasons. There are 3 different time range splits: monthly, yearly and all-time. Very big wikis are partitioned monthly, while medium wikis are partitioned yearly, and small wikis are dumped in one single file. This way we ensure that files are not larger than ~2GB, and at the same time we prevent generating a very large number of files.
- Wikis partitioned monthly: wikidatawiki, commonswiki, enwiki.
- Wikis partitioned yearly: dewiki, frwiki, eswiki, itwiki, ruwiki, jawiki, viwiki, zhwiki, ptwiki, enwiktionary, plwiki, nlwiki, svwiki, metawiki, arwiki, shwiki, cebwiki, mgwiktionary, fawiki, frwiktionary, ukwiki, hewiki, kowiki, srwiki, trwiki, loginwiki, huwiki, cawiki, nowiki, mediawikiwiki, fiwiki, cswiki, idwiki, rowiki, enwikisource, frwikisource, ruwiktionary, dawiki, bgwiki, incubatorwiki, enwikinews, specieswiki, thwiki.
- Wikis in one single file: all the others.
File format
The file format is tab-separated-value (TSV) instead of JSON in order to reduce the file sizes (JSON repeats field names for every record). Most fields of the schema are atomic (integer, string, boolean...), and a few are arrays of strings.
Some details:
- Undefined or null values are represented as an empty fields, again to make data lighter
- Encoding of string-arrays is
value1,value2,...,valueN
with commas escaped in values. - In text fields, carriage-returns, line-feed and tabulations are escaped with a \ to keep a valid TSV format
The files are compressed in Bzip2, for it being widely used, free software, and having a high compression rate. Note that with Bzip2, you can concatenate several compressed files and treat them as a single Bzip2 file.
Directory structure
When choosing a file (or set of files) to download, the URL should look like this:
/<version>/<wiki>/<version>.<wiki>.<time range>.tsv.bz2
Where
- <version> is the YYYY-MM formatted snapshot i.e. 2019-08;
- <wiki> is the wiki database name, i.e. enwiki or commonswiki;
- <time_range> is either YYYY-MM for big wikis, YYYY for medium wikis, or all-time for the rest (see partitionning above).
Examples of dump files:
/2019-12/wikidatawiki/2019-12.wikidatawiki.2019-05.tsv.bz2
/2019-12/ptwiki/2019-12.ptwiki.2018.tsv.bz2
/2019-12/cawikinews/2019-12.cawikinews.all-time.tsv.bz2
Technical Documentation
Note: In the documentation below, "current" refers to the time of the snapshot, and "historical" to the time of the event. A subpage here lists answers to some frequently answered questions: Analytics/Data Lake/Edits/Mediawiki history dumps/FAQ.
Access
The easiest way to play with dumps is to use PAWS. See these example notebooks.
You can access the dumps through Toolforge. If you have a Cloud VPS instance, you can add the mount_nfs
role to get the /public/dumps/public
mount. But you also need to enable the mount server-side, see this patch for example. See Portal:Data_Services/Admin/Runbooks/Enable_NFS_for_a_project for full details.
Schema overview
The dataset contains many fields (70 to be precise), but there is some structure helping in making sense of them. The fields can be divided in 5 classes:
event global
fields -- They are used on every event of the dataset (wiki_db
,event_entity
,event_type
,event_timestamp
,event_comment
).event user
fields -- They provide information on the user having performed the event. They are set for all events in the dataset except when denormalizing user data has failed.page
fields -- They provide information about the page the event applies to. They are set for page events (event_entity = 'page'
) and revision events (event_entity = 'revision'
).user
fields -- They provide information about the user the event applies to. They are set for user events only (event_entity = 'user'
).revision
fields -- They provide information about the revision the event applies to. They are set for revision events only (event_entity = 'revision'
).
Note: Except for the event global
class fields whose prefix is not consistent, all other have their field name prefixed with their field class.
Important fields: event_entity and event_type
Due to having user, page and revision events in the same dataset (it is said to be denormalized), filtering by event_entity
and possibly even event_type
is necessary not to mix incompatible data.
Entity | Event type | Meaning |
---|---|---|
revision | create | Editing a page |
page | create | Creating a page |
create-page | Page creation according to the logging table [note 1 below] | |
delete | Deleting a page | |
move | Changing a page's title | |
restore | Undeleting a page | |
merge | Merging revisions from another page [note 2 below] | |
user | create | Registering of a new account |
rename | Changing the name of a user | |
altergroups | Changing the groups (rights) of a user | |
alterblocks | Blocking/unblocking a user |
- note 1: Establishing exactly when a page was created is not simple. The logging table has a record for page creation, and we expose this in our datasets as a "create-page" event. However, the first revision for some pages is *before* this logging table entry. Therefore, we made a decision to use that event as the "create". You can follow along with our logic at PageHistoryBuilder #L778 and at PageEventBuilder #L180.
- note 2: we don't process merges much, and the documentation is sparse: https://www.mediawiki.org/wiki/Manual:Log_actions
Schema details
Field class | Field name | Data type | Comment |
---|---|---|---|
Event_global | wiki_db | string | enwiki, dewiki, eswiktionary, etc. |
event_entity | string | revision, user or page | |
event_type | string | create, move, delete, etc. Detailed explanation in the docs under #Event_types | |
event_timestamp | string | When this event ocurred | |
event_comment | string | Comment related to this event, sourced from log_comment, rev_comment, etc. | |
Event user | event_user_id | bigint | ID of the user that caused the event. Null if the user is anonymous or if from a revision where the user has been revision deleted. |
event_user_text_historical | string | Historical username (IP address for anonymous user) of the user that caused the event. Null for revisions where the user has been revision deleted. | |
event_user_text | string | Current username of the user that caused the event. Null for anonymous users (the IP is stored in event_user_text_historical). Null for revisions where the user has been revision deleted. | |
event_user_blocks_historical | array<string> | Historical blocks of the user that caused the event | |
event_user_blocks | array<string> | Current blocks of the user that caused the event | |
event_user_groups_historical | array<string> | Historical groups of the user that caused the event | |
event_user_groups | array<string> | Current groups of the user that caused the event | |
event_user_is_bot_by_historical | array<string> | Historical bot information of the user that caused the event, can contain values name or group | |
event_user_is_bot_by | array<string> | Bot information of the user that caused the event, can contain values name or group | |
event_user_is_created_by_self | boolean | Whether the event_user created their own account | |
event_user_is_created_by_system | boolean | Whether the event_user account was created by mediawiki (eg. centralauth) | |
event_user_is_created_by_peer | boolean | Whether the event_user account was created by another user | |
event_user_is_anonymous | boolean | Whether the event_user is not registered. True for revisions where the user has been revision deleted, even if the user was actually registered. | |
event_user_registration_timestamp | string | Registration timestamp of the user that caused the event (from user table) | |
event_user_creation_timestamp | string | Creation timestamp of the user that caused the event (from logging table) | |
event_user_first_edit_timestamp | string | Timestamp of the first edit of the user that caused the event | |
event_user_revision_count | bigint | Number of revisions made by the event_user up to the historical time in this wiki_db (only available in revision-create events so far). For revision-create events, this includes the event itself.
| |
event_user_seconds_since_previous_revision | bigint | In revision events: seconds elapsed since the previous revision made by the current event_user_id (only available in revision-create events so far) | |
page | page_id | bigint | In revision/page events: id of the page |
page_title_historical | string | In revision/page events: historical title of the page | |
page_title | string | In revision/page events: current title of the page | |
page_namespace_historical | int | In revision/page events: historical namespace of the page. | |
page_namespace_is_content_historical | boolean | In revision/page events: historical namespace of the page is categorized as content | |
page_namespace | int | In revision/page events: current namespace of the page | |
page_namespace_is_content | boolean | In revision/page events: current namespace of the page is categorized as content | |
page_is_redirect | boolean | In revision/page events: whether the page is currently a redirect | |
page_is_deleted | boolean | In revision/page events: Whether the page is rebuilt from a delete event | |
page_creation_timestamp | string | In revision/page events: creation timestamp of the page | |
page_first_edit_timestamp | string | In revision/page events: timestamp of the page's first revision. Can be before the page_creation in some restore/merge cases (see revision_is_from_before_page_creation). | |
page_revision_count | bigint | In revision/page events: Cumulative revision count per page for the current page_id (only available in revision-create events so far) | |
page_seconds_since_previous_revision | bigint | In revision/page events: seconds elapsed since the previous revision made on the current page_id (only available in revision-create events so far) | |
user | user_id | bigint | In user events: id of the user |
user_text_historical | string | In user events: historical username or IP address of the user | |
user_text | string | In user events: current username or IP address of the user | |
user_blocks_historical | array<string> | In user events: historical user blocks | |
user_blocks | array<string> | In user events: current user blocks | |
user_groups_historical | array<string> | In user events: historical user groups | |
user_groups | array<string> | In user events: current user groups | |
user_is_bot_by_historical | array<string> | In user events: Historical bot information of the user, can contain values name or group | |
user_is_bot_by | array<string> | In user events: Bot information of the user, can contain values name or group | |
user_is_created_by_self | boolean | In user events: whether the user created their own account | |
user_is_created_by_system | boolean | In user events: whether the user account was created by mediawiki | |
user_is_created_by_peer | boolean | In user events: whether the user account was created by another user | |
user_is_anonymous | boolean | In user events: whether the user is not registered | |
user_registration_timestamp | string | In user events: registration timestamp of the user. | |
user_creation_timestamp | string | In user events: Creation timestamp of the user (from logging table) | |
user_first_edit_timestamp | string | In user events: Timestamp of the first edit of the user | |
revision | revision_id | bigint | In revision events: id of the revision |
revision_parent_id | bigint | In revision events: id of the parent revision | |
revision_minor_edit | boolean | In revision events: whether it is a minor edit or not | |
revision_deleted_parts | array<string> | In revision events: Deleted parts of the revision, can contain values text, comment and user | |
revision_deleted_parts_are_suppressed | boolean | In revision events: Whether the deleted parts are deleted to admin as well (visible only by stewards) | |
revision_text_bytes | bigint | In revision events: number of bytes of revision | |
revision_text_bytes_diff | bigint | In revision events: change in bytes relative to parent revision (can be negative). | |
revision_text_sha1 | string | In revision events: sha1 hash of the revision | |
revision_content_model | string | In revision events: content model of revision | |
revision_content_format | string | In revision events: content format of revision | |
revision_is_deleted_by_page_deletion | boolean | In revision events: whether this revision has been deleted (moved to archive table) | |
revision_deleted_by_page_deletion_timestamp | string | In revision events: the timestamp when the revision was deleted | |
revision_is_identity_reverted | boolean | In revision events: whether this revision was reverted by another future revision | |
revision_first_identity_reverting_revision_id | bigint | In revision events: id of the revision that reverted this revision | |
revision_seconds_to_identity_revert | bigint | In revision events: seconds elapsed between revision posting and its revert (if there was one) | |
revision_is_identity_revert | boolean | In revision events: whether this revision reverts other revisions | |
revision_is_from_before_page_creation | boolean | In revision events: True if the revision timestamp is before the page creation (can happen with restore events) | |
revision_tags | array<string> | In revision events: Tags associated to the revision |
Code examples
Changes and known problems
Date | PhabTask | Snapshot version | Details |
---|---|---|---|
2020-01 | 2019-12 | Initial release |