Data Platform/Data Lake/Traffic/Pageview actor
The wmf.pageview_actor table (available on Hive) contains filtered webrequest data to keep only pageviews and redirects to pageviews. It keeps most dimensions from webrequest, has an updated agent_type
value flagging traffic estimated automated
, and offers the actor_signature
field facilitating in-project session-fingerprinting. It is stored in the Parquet columnar file format and partitioned by (year, month, day, hour). As webrequest, the data is deleted after 90 days.
This intermediary dataset is meant to be used as a replacement for webrequest when queries filter for pageviews. The reason it should be used is because it is about ten times smaller than webrequest for the same time-period, and therefore is a lot faster to query. For instance, production jobs generating pageview_hourly, unique-devices or clickstream take advantage of this table.
Note: This table doesn't aggregate rows as pageview_hourly does, it only filters.
Current Schema
$ hive --database wmf hive (wmf)> describe pageview_actor; OK col_name data_type comment hostname string Source node hostname sequence bigint Per host sequence number dt string Timestame at cache in ISO 8601 time_firstbyte double Time to first byte ip string IP of packet at cache http_status string HTTP status of response response_size bigint Response size uri_host string Host of request uri_path string Path of request uri_query string Query of request content_type string Content-Type header of response referer string Referer header of request user_agent string User-Agent header of request accept_language string Accept-Language header of request is_pageview boolean Indicates if this record was marked as a pageview during refinement is_redirect_to_pageview boolean Indicates if this record was marked as a redirect to a pageview during extraction (needed for unique-devices-per-project-family) geocoded_data map<string,string> Geocoded map with continent, country_code, country, city, subdivision, postal_code, latitude, longitude, timezone keys and associated values. user_agent_map map<string,string> User-agent map with browser_family, browser_major, device_family, os_family, os_major, os_minor and wmf_app_version keys and associated values x_analytics_map map<string,string> X_analytics map view of the x_analytics field ts timestamp Unix timestamp in milliseconds extracted from dt access_method string Method used to access the site (mobile app|mobile web|desktop) agent_type string Categorise the agent making the webrequest as either user, spider or automated referer_class string Indicates if a referer is internal, external or unknown. normalized_host struct<project_class:string,project:string,qualifiers:array<string>,tld:string,project_family:string> struct containing project_family (such as wikipedia or wikidata for instance), project (such as en or commons), qualifiers (a list of in-between values, such as m) and tld (org most often) pageview_info map<string,string> map containing project, language_variant and page_title values only when is_pageview = TRUE. page_id bigint MediaWiki page_id for this page title. For redirects this could be the page_id of the redirect or the page_id of the target. This may not always be set, even if the page is actually a pageview. namespace_id int MediaWiki namespace_id for this page title. This may not always be set, even if the page is actually a pageview. actor_signature string The actor signature for the record using domain, computed as a hash actor_signature_per_project_family string The actor signature for the record using project-family, computed as a hash year int Unpadded year of pageviews month int Unpadded month of pageviews day int Unpadded day of pageviews hour int Unpadded hour of pageviews # Partition Information # col_name data_type comment year int Unpadded year of pageviews month int Unpadded month of pageviews day int Unpadded day of pageviews hour int Unpadded hour of pageviews
Notice the year, month, day, and hour fields. These are Hive partitions, and are explicit mappings to hourly folders in HDFS. You must include at least one partition predicate in the where clause of your queries (even if it is just year > 0). Partitions allow you to reduce the amount of data that Hive must parse and process before it returns you results. For example, if are only interested in data during a particular day, you could add where year = 2014 and month = 1 and day = 12. This will instruct Hive to only process data for partitions that match that partition predicate. You may use partition fields as you would any normal field, even though the field values are not actually stored in the data files.
Sample queries
SELECT
concat(month,'-',day,'-',year), agent_type, count(1)
FROM
wmf.pageview_actor
WHERE
year = 2020
AND month = 6
AND day = 25
AND is_pageview
GROUP BY
year, month, day, agent_type;
Changes and known problems since 2020-06-25
- See also Analytics/Data_Lake/Traffic/Webrequest#Changes and known problems since 2015-03-04 for issues and updates affecting all webrequests (including non-pageviews)
Date from | Task | Details |
---|---|---|
2020-06-01[1] | task T225467 | Create the table and start filtering data. |
See also
- The code that generates it:
Notes
- ↑ Action taken on the 25th of June, but data has been backfilled from the beginning of the month