Jump to content

Metrics Platform/Analytics sampling

From Wikitech

This page documents the options for data collection sampling supported by the Metrics Platform. Analytics sampling specifies how events are determined to be in-sample (sent) or out-of-sample (thrown away). You can configure sampling as part of event stream configuration or using the Experimentation Lab.

Analytics sampling controls data collection, not data generation.
  • Data generation: Instrument code determines when to submit events.
  • Data collection: Analytics sampling determines which events actually get sent to be processed and put into the database
When instrumenting an A/B test, it is the responsibility of the instrumentation code to determine which clients get which feature variants; sampling logic cannot be used for experiment enrolment sampling.

Location

The location field allows you to set sampling logic per wiki. The default location applies to all wikis not specified in other rules. The wiki names used in the keys are database names, such as enwiki for English Wikipedia; see Configuration files.

Sample rate

The sample rate is the proportion of identifiers that are considered in-sample:

  • 1.0 (100%) by default, can be overridden in individual streams
  • set to 0.0 to disable the stream (if you want to keep the stream in the config but prevent events from being sent to it)
  • uses "widening the net" approach: IDs determined to be in-sample at lower rates will be determined to be in-sample at higher rates

For example: Suppose we have 4 streams: A, B, C, and D with sampling rates 0.01, 0.1, 0.25, 0.5, respectively. Those streams could be using the same schema or different ones. But specifically, those streams use the same identifier – let's say it's the session token. Remember, in the MEP paradigm streams map to tables inside the database. Here's what you should expect to see in those tables for any time period:

  • Table A will have data from approximately 1% of active sessions in that time period
  • Table B will have data from approx. 10% of active sessions at that time, but definitely all of the sessions found in table A
  • Table C " " " " ~25% of active sessions at that time, but definitely all of the sessions found in tables A & B
  • Table D " " " " ~half of active sessions at that time, but definitely all of the sessions found in tables A, B, and C

Sample unit

The sample unit is the segment that is used to determine which events are in-sample. You can choose to segment your sample by pageview, session, or device.

Pageview

Web-specific streams can be configured to use the "pageview" unit. This will cause the determination to be made on a page-by-page basis and can be useful for getting a random sample of page views, not sessions.

Every pageview is an independent event with a unique ID. A new pageview ID is generated when the user:

  • Navigates to a page;
  • Refreshes the page;
  • Opens the page again in the same window or tab; or
  • Opens the page again in a different window or tab
FIXME: The pageview ID should be regenerated when navigating away from and then back to the page quickly.

Actions which occur within the scope of a single pageview can be correlated by pageview ID. Actions which occur within the scope of multiple pageviews cannot be correlated by pageview ID. These actions must have taken place on a single device and it is safe to assume they were performed by a single user. However, a pageview ID cannot be used as a proxy for an individual user.

Session

A browsing session consists of one or more pageviews on one domain. A new session ID is generated when the user:

  • First navigates to a page;
  • Opens the page again in a private browsing window or tab; or
  • When the session expires

Actions which occur within the scope of a session – i.e. within the scope of multiple pageviews – can be correlated by session ID. These actions must have taken place on a single device and it safe to assume they were performed by a single user. A session ID can be used as a proxy for an individual user.

Session expiry

Sessions can expire on the Wikipedias and in the iOS and Android apps. When a session expires a new session ID is generated. Now, the mechanism for session expiry on the Wikipedias is different for that in the apps:

  • On MediaWiki, a session expires if the user has not clicked, typed, or scrolled in the foreground window or tab for at least 30 minutes;
  • In the apps, a session expires if the user has not used the app for at least 30 minutes

Session scope

On MediaWiki, the session ID is per-domain. If a user views a page on domain A, clicks an interwiki link, and views a page on domain B, then they have two session IDs. Currently, we cannot link those two session IDs.

Additional info about sessions can be found at Analytics/Sessions.

Device

Mobile app-specific streams can be configured to use the "device" unit (app_install_id on iOS and Android). This will cause the determination to be made on a device-by-device basis. If a device is determined to be in-sample, all of their sessions and events will be in-sample. This is useful for retention metrics, cohort and longitudinal analyses, and cross-session analysis.

An app install consists of one or more sessions. A new app install ID is generated when the user first opts into tracking, and opts out of tracking and then opts back into tracking.

Actions which occur within the scope of an app install – i.e. within the scope of multiple sessions – can be correlated by app install ID. These actions must have taken place on a single device and it safe to assume they were performed by a single user. An app install ID can be used as a proxy for an individual user. However, because an app install ID can be regenerated, it cannot be used as a proxy for an individual device.