Jump to content

Data Platform/Data Lake/Traffic/Unique Devices

From Wikitech

How is this data computed

We compute this data using the Last-Access cookie. For details see Analytics/Data Lake/Traffic/Unique Devices/Last access solution and m:Research:Unique Devices.

Tables schema

As of 2017-07, there are 4 'unique devices' tables available in the wmf database on Hive:

  • unique_devices_per_domain_daily stores unique devices counts per domain (e.g. en.m.wikipedia.org) split by country per day
  • unique_devices_per_domain_monthly stores unique devices counts per domain split by country per month
  • unique_devices_per_project_family_daily stores unique devices counts per project (e.g. Wikipedia) split by country per day
  • unique_devices_per_project_family_monthly stores unique devices counts per project split by country per month
unique_devices_per_domain_daily / unique_devices_per_domain_monthly
domain string Lower cased domain accessed (en.wikipedia.org for instance)
country string Country name of the accessing agents (computed using maxmind GeoIP database)
country_code string 2 letter country code
uniques_underestimate int Under estimation of unique devices based on Last-Access cookie, and the nocookies header. Unique Devices that came to a given host at least twice.
uniques_offset int Unique devices offset computed as 1-action sessions without cookies.
uniques_estimate int Estimate of total unique devices seen as uniques_underestimate plus offset
year int Unpadded year of requests
month int Unpadded month of requests
day int Unpadded day of requests (only for the unique_devices_..._dailytables)
unique_devices_per_project_family_daily / unique_devices_per_project_family_monthly
project_family string Lower cased project accessed (Wikipedia or Wikivoyage for instance)
country string Country name of the accessing agents (computed using the MaxMind GeoIP database)
country_code string 2 letter country code
uniques_underestimate int Under-estimation of unique devices based on the Last-Access global cookie and the nocookies header. Unique Devices that came to a given project family at least twice.
uniques_offset int Unique devices offset computed as 1-action sessions without cookies.
uniques_estimate int Estimate of total unique devices seen as uniques_underestimate plus offset
year int Unpadded year of requests
month int Unpadded month of requests
day int Unpadded day of requests (only for last_access_uniques_global_daily)

Data quirks

There are some minor domains in the per-domain datasets that do not match a wiki listed in canonical_data.wikis. These are generally redirect domains (e.g. mai.wiktionary.org, which redirects to the corresponding Incubator project, or za.wikimedia.org, which used to redirect to za.wikipedia.org) or non-wiki domains that use our infrastructure (e.g. noc.wikimedia.org).

Sample query to get total uniques for a given host or project_family for a day

SELECT
  SUM(uniques_estimate)
FROM wmf.unique_devices_per_domain_daily
WHERE year=2015 AND month=12 AND day=24
  AND domain = 'es.wikipedia.org';
SELECT
  SUM(uniques_estimate)
FROM wmf.unique_devices_per_project_family_daily
WHERE year=2017 AND month=4 AND day=1
  AND project_family = 'wikipedia';

Data Quality

The Last-Access based uniques metric has proven having a lot of variability for small projects.

Please read Analytics/Data_Lake/Traffic/Unique_Devices/Last_access_solution#Data_Quality_Analysis.

Changes and Known Problems with Dataset

  • 2016-02-19: Monthly per-domain data is available as of January 2016.
Date from Date until Task Details
2025-04-01 task T386177 Webrequest source changes to HAProxy
Feb 9, 2021 June 30, 2022 task T316572 Unique devices by family metrics has been overcounted by approx ~5% globally. For more details, read Analytics/Data Lake/Data Issues/2021-02-09 Unique Devices By Family Overcount
2020-06-24 (daily) / 2020-06-01 (monthly) task T250744 Quality improvement through removal of automated traffic. See Analytics/Data Lake/Traffic/Unique Devices/Automated traffic correction
2018-05-30 2018-06-03 task T199517 June Unique devices increase of 170% for wikidata
start 2017-05-18 task T165661 Per-domain unique-devices computation excluded countries that didn't have either underestimates or offset until 2017-05-18.
start 2017-06-11 task T167005 Per-Domain unique-devices computation was under-counting fresh sessions (offset) by about 10% until 2017-06-11.
2016-11-04 2017-02-14 task T165560 Artificial spike in offset of unique devices from November to February on wikidata likely related to varnish4 rollout

See also