Data Platform/Data Lake/Traffic/Browser general
wmf.browser_general is an intermediate table that makes it easier and faster to get pageview statistics broken down by user-agent related dimensions like OS family, OS major, browser family, browser major, etc. In addition to ad-hoc queries, it is also used to generate the Browser Reports.
Current Schema
$ hive --database wmf hive (wmf)> describe browser_general; OK col_name data_type comment access_method string (desktop|mobile web|mobile app) os_family string OS family: Windows, Android, etc. os_major string OS major version: 8, 10, etc. browser_family string Browser family: Chrome, Safari, etc. browser_major string Browser major version: 47, 11, etc. view_count bigint Number of pageviews. year int Unpadded year of request. month int Unpadded month of request. day int Unpadded day of request. # Partition Information # col_name data_type comment year int Unpadded year of request. month int Unpadded month of request. day int Unpadded day of request.
Notice the year, month and day fields. These are Hive partitions, and are explicit mappings to daily files in HDFS. You must include at least one partition predicate in the where clause of your queries (even if it is just year > 0
). Partitions allow you to reduce the amount of data that Hive must parse and process before it returns your results. You may use partition fields as you would any normal field, even though the field values are not actually stored in the data files.
Loading of anonymized data
wmf.browser_general is loaded through this oozie job. It is a basic query that extracts data from wmf.pageview_hourly
and wmf.projectview_hourly
, with a threshold-based anonymization applied to it. For privacy and readability reasons, any rows that initially represent less than K% of the total view count are grouped into a single row, where all columns (os_family, os_major, browser_family, browser_major) are assigned to the "Other" constant (see job config for exact K and constant values).
Example query
SELECT browser_family, browser_major, SUM(view_count) as view_count FROM wmf.browser_general WHERE access_method IN ('desktop', 'mobile web') AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '2016-01-01' AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) < '2016-02-01' GROUP BY browser_family, browser_major ORDER BY view_count DESC LIMIT 1000;
Changes and known problems since 2016-03-21
Date from | Date until | Task | Details |
---|---|---|---|
beginning | now | task T148461 | Inconsistent data in #all-sites-by-os-and-browser for IE7 |
... | ... | ... | ... |