Jump to content

Metrics Platform/Sampling

From Wikitech

This page provides details about:

  1. The properties that may be used as sampling units; and
  2. The properties of an experiment enrolment sampling algorithm

Sampling Units

Pageview ID

Every pageview is an independent event with a unique ID. A new pageview ID is generated when the user:

  • Navigates to a page;
  • Refreshes the page;
  • Opens the page again in the same window or tab; or
  • Opens the page again in a different window or tab
FIXME: The pageview ID should be regenerated when navigating away from and then back to the page quickly.

Actions which occur within the scope of a single pageview can be correlated by pageview ID. Actions which occur within the scope of multiple pageviews cannot be correlated by pageview ID. These actions must have taken place on a single device and it is safe to assume they were performed by a single user. However, a pageview ID cannot be used as a proxy for an individual user.

Session ID

A browsing session consists of one or more pageviews on one domain. A new session ID is generated when the user:

  • First navigates to a page;
  • Opens the page again in a private browsing window or tab; or
  • When the session expires

Actions which occur within the scope of a session – i.e. within the scope of multiple pageviews – can be correlated by session ID. These actions must have taken place on a single device and it safe to assume they were performed by a single user. A session ID can be used as a proxy for an individual user.

Session Expiry

Sessions can expire on the Wikipedias and in the iOS and Android apps. When a session expires a new session ID is generated. Now, the mechanism for session expiry on the Wikipedias is different for that in the apps:

  • On MediaWiki, a session expires if the user has not clicked, typed, or scrolled in the foreground window or tab for at least 30 minutes;
  • In the apps, a session expires if the user has not used the app for at least 30 minutes

Session Scope

On MediaWiki, the session ID is per-domain. If a user views a page on domain A, clicks an interwiki link, and views a page on domain B, then they have two session IDs. Currently, we cannot link those two session IDs.

Additional info about sessions can be found at Analytics/Sessions.

App Install ID

An app install consists of one or more sessions. A new app install ID is generated when the user first opts into tracking, and opts out of tracking and then opts back into tracking.

Actions which occur within the scope of an app install – i.e. within the scope of multiple sessions – can be correlated by app install ID. These actions must have taken place on a single device and it safe to assume they were performed by a single user. An app install ID can be used as a proxy for an individual user. However, because an app install ID can be regenerated, it cannot be used as a proxy for an individual device.

Analytics Enrolment Sampling

FIXME: Transfer documentation from Wikimedia Product/Analytics Infrastructure/Stream configuration (legacy)

The algorithm uses a "widening the net" approach: units determined to be in-sample at lower rates will be determined to be in-sample at higher rates.

Experiment Enrolment Sampling

Experiment enrolment sampling is the act of enrolling users into experiments and consistently assigning an enrolled user a variant of the feature that is being experimented on.

An experiment enrolment sampling algorithm, therefore, is a function, method, or process that accepts some inputs and returns a variant, i.e.

module Experiments {
    enrol( user: User, experiment: Experiment ): Variant;
}

Where:

  • user is at least a token that represents that user for at least the duration of the experiment; and
  • experiment is one or more constants that define the experiment, e.g. name, sample rate, and variants

Properties

Such an algorithm must:

  • Ensure consistency of assignment within experiment. For example, if there are two experiments running, both having two variants, then the same user should be assigned the same variant for the same experiment in such a way so as to ensure that the following assignments are equally likely:
Experiment 1 Experiment 2
Variant 1 Variant 1
Variant 1 Variant 2
Variant 2 Variant 1
Variant 2 Variant 2

and not:

Experiment 1 Experiment 2
Variant 1 Variant 1
Variant 2 Variant 2
  • Be able to sample on a variety of levels, e.g. page, session, user, application install
  • Sample when needed, e.g. sample when a user visits a specific page and thereby not assign groups to users who never visit that page
  • Not require a backing store

Caveats

In order for the first and last properties mentioned above to hold, any system using such an algorithm must lock or freeze the inputs to the algorithm for the duration of the experiment. However, it should be OK to extend the end date of an in-progress experiment.

Notes

  1. This section is wholly based on the discussion in T372108 Document desired properties of an enrollment sampling algorithm