Jump to content

Data Platform/Data Lake/Traffic/Projectview hourly

From Wikitech

The wmf.projectview_hourly table (available on Hive) is 'pre-aggregated' webrequest data, filtered to keep only pageviews, and aggregated over a predefined set of dimension. It is stored in the Parquet columnar file format and partitionned by (year, month, day hour). It is different from the wmf.pageview_hourly dataset in that it involves less dimensions and is therefore smaller in data size (and faster to query). The year, month, day and hour fields follow the UTC time standard.

Current Schema

$ hive --database wmf

hive (wmf)> describe projectview_hourly;
OK
col_name    data_type   comment
project             	string              	Project name from requests hostname
access_method       	string              	Method used to access the pages, can be desktop, mobile web, or mobile app
zero_carrier        	string              	Zero carrier if pageviews are accessed through one, null otherwise
agent_type          	string              	Agent accessing the pages, can be spider, automated, or user
referer_class       	string              	Can be internal, external or unknown
continent           	string              	Continent of the accessing agents (computed using maxmind GeoIP database)
country_code        	string              	Country iso code of the accessing agents (computed using maxmind GeoIP database)
record_version      	string              	Keeps track of changes in the table content definition - https://wikitech.wikimedia.org/wiki/Analytics/Data/Projectview_hourly
view_count            	bigint              	number of pageviews 
year                	int                 	Unpadded year of projectviews
month               	int                 	Unpadded month of projectviews
day                 	int                 	Unpadded day of projectviews
hour                	int                 	Unpadded hour of projectviews 	
 
# Partition Information	 	 
# col_name            	data_type           	comment             	 	 
year                	int                 	Unpadded year of projectviews
month               	int                 	Unpadded month of projectviews
day                 	int                 	Unpadded day of projectviews
hour                	int                 	Unpadded hour of projectviews

Notice the year, month, day, and hour fields. These are Hive partitions, and are explicit mappings to hourly aggregations in HDFS. You must include at least one partition predicate in the where clause of your queries (even if it is just year > 0). Partitions allow you to reduce the amount of data that Hive must parse and process before it returns you results. For example, if are only interested in data during a particular day, you could add where year = 2014 and month = 1 and day = 12. This will instruct Hive to only process data for partitions that match that partition predicate. You may use partition fields as you would any normal field, even though the field values are not actually stored in the data files.

Changes and known problems since 2015-06-22

See also m:Research:Page view#Change log for changes to the page view definition itself
Date from Task record_version Details
2015-04-01[1] 0.0.1 Create table with projectviews aggregated from pre-aggregated pageviews source and backfill aggregation from beginning of April.
April data doesn't have zero_carrier, referer_class not continent set due to data definition changes.

See also

The code that generates it:

Notes

  1. Action taken on the 22nd of June, but since data is available from beginning of April, the date of this line is set so