Data Platform/Data Lake/Traffic/Projectview hourly
The wmf.projectview_hourly table (available on Hive) is 'pre-aggregated' webrequest data, filtered to keep only pageviews, and aggregated over a predefined set of dimension. It is stored in the Parquet columnar file format and partitionned by (year, month, day hour). It is different from the wmf.pageview_hourly dataset in that it involves less dimensions and is therefore smaller in data size (and faster to query). The year, month, day and hour fields follow the UTC time standard.
Current Schema
$ hive --database wmf hive (wmf)> describe projectview_hourly; OK col_name data_type comment project string Project name from requests hostname access_method string Method used to access the pages, can be desktop, mobile web, or mobile app zero_carrier string Zero carrier if pageviews are accessed through one, null otherwise agent_type string Agent accessing the pages, can be spider, automated, or user referer_class string Can be internal, external or unknown continent string Continent of the accessing agents (computed using maxmind GeoIP database) country_code string Country iso code of the accessing agents (computed using maxmind GeoIP database) record_version string Keeps track of changes in the table content definition - https://wikitech.wikimedia.org/wiki/Analytics/Data/Projectview_hourly view_count bigint number of pageviews year int Unpadded year of projectviews month int Unpadded month of projectviews day int Unpadded day of projectviews hour int Unpadded hour of projectviews # Partition Information # col_name data_type comment year int Unpadded year of projectviews month int Unpadded month of projectviews day int Unpadded day of projectviews hour int Unpadded hour of projectviews
Notice the year, month, day, and hour fields. These are Hive partitions, and are explicit mappings to hourly aggregations in HDFS. You must include at least one partition predicate in the where clause of your queries (even if it is just year > 0). Partitions allow you to reduce the amount of data that Hive must parse and process before it returns you results. For example, if are only interested in data during a particular day, you could add where year = 2014 and month = 1 and day = 12. This will instruct Hive to only process data for partitions that match that partition predicate. You may use partition fields as you would any normal field, even though the field values are not actually stored in the data files.
Usually, partitions are used to allow working smaller data than the full table. In our case, data is very small, and we have chosen to hourly partition the table nonetheless. Despite Hadoop small files problem, having hourly success flags for Oozie is worth the price. |
Changes and known problems since 2015-06-22
- See also m:Research:Page view#Change log for changes to the page view definition itself
Date from | Task | record_version | Details |
---|---|---|---|
2015-04-01[1] | 0.0.1 | Create table with projectviews aggregated from pre-aggregated pageviews source and backfill aggregation from beginning of April. April data doesn't have zero_carrier, referer_class not continent set due to data definition changes.
|
See also
The code that generates it:
- https://github.com/wikimedia/analytics-refinery/blob/master/hql/projectview/hourly/create_projectview_hourly_table.hql
- https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/analytics/dags/projectview/projectview_hourly_dag.py
- Data in this table goes back to April 2015 only. For historical traffic comparisons with earlier times, your best bet might be
staging.pageviews05
(aka Cube v0.5, a private database table available via stat1006 that contains sampled pageview data according to an earlier implementation of the same page view definition, covering a timespan from 2013 until April 2015). See Meta-wiki for a comparison listing discrepancies between these two tables.
Notes
- ↑ Action taken on the 22nd of June, but since data is available from beginning of April, the date of this line is set so