Data Platform/Data Lake/Traffic
Traffic refers to pageviews to the pages of a wiki project. This page links to detailed information about traffic datasets in the Data Lake.
Most of the datasets below are updated at hourly granularity, meaning that you'll get an hour of new data every hour, with between 2 and 3 hours delay (for the hour to be finished, and the data to be computed).
Datasets
Hive tables
These datasets are available as Hive tables and can be queried using one of the available SQL engines, or accessed directly through HDFS.
Dataset Name | Description |
---|---|
webrequest hive table
- See also a separate list of Hive tables derived from webrequest |
The webrequest stream contains data on all the hits to Wikimedia's servers. This includes requests for page HTML, images, CSS, and Javascript, as well as requests to the API. |
pageview_actor hive table | The wmf.pageview_actor table is a smaller version of webrequest table with fewer columns. |
pageview_hourly hive table | The wmf.pageview_hourly table contains 'pre-aggregated' webrequest data, filtered to keep only pageviews, and aggregated over a predefined set of dimensions. |
projectview_hourly hive table | The wmf.projectview_hourly table is 'pre-aggregated' webrequest data at the project level. It is different from the wmf.pageview_hourly dataset in that it involves less dimensions and is therefore smaller in data size (and faster to query).
|
unique devices | This dataset gives you how many distinct devices visit our projects |
browser general | This dataset gives you pageview statistics broken down by user-agent related dimensions like OS family, OS major, browser family, browser major |
mediawiki_api_request | The mediawiki_api_request table provides the log of api requests to MediaWiki
|
mobile apps session metrics | Contains aggregate stats about pageview sessions on the Android and iOS Wikipedia mobile apps |
mobile apps uniques | Counts how many different Android and iOS Wikipedia mobile apps installs accessed Wikimedia sites during the given day or month |
inter language | Traffic between different languages on the same project family |
virtualpageview_hourly | Provides data about page previews on desktop Wikipedia |
Dumps
These datasets are made available as files, updated at regular intervals.
- Pageviews and Projectviews dumps [To be updated]
- Compressed pageviews dumps [To be updated]
- mediacounts
- Wikipedia clickstream
Deprecated or Obsolete Datasets
The following datasets are no longer in use, but the pages are kept to document history:
Access
All data in the Data Lake is private by default. For this, reference Data_Platform/Data access. Some of the data above is public in other systems (see Analytics main page)
History
Some partial information about the evolution of publishing analytics data at WMF is recorded here in a timeline.