User:TBurmeister (WMF)/Sandbox/Data platform/Analyze data

This page contains historical information. It may be outdated or unreliable.

This page outlines the tools and systems available for analyzing private Wikimedia data. For public data, see meta:Research:Data.

Key terms

This term list focuses only on what you should know to get started using the Data Platform for analysis. A more comprehensive glossary is at Analytics/Systems/Cluster#Glossary.

Analytics clients: Analytics clients (also called "stat boxes") are servers in the production cluster that enable you to access private data and resources for statistical computation. They're called "analytics clients" because they act as clients accessing data from other databases.

Analytics cluster: The "analytics cluster" is a catch-all term for compute resources and services running inside of the Analytics VLAN, which itself is inside of WMF production network. Individual systems within the analytics cluster include Hadoop, and related components that run the Data Lake.

Data Lake: The Data Lake is a large, analytics-oriented repository of Wikimedia data.

Hadoop: A collection of services for batch processing of large data. See Hadoop.

HDFS: A file system for the Hadoop framework, which WMF uses to store files of various formats the Data Lake.

Hive: A system that projects structure onto flat data (text or binary) in HDFS and allows this data to be queried using an SQL-like syntax.

Get access to internal data

Private data lives in same server cluster that runs Wikimedia's production websites. Often, this means you need production access to access it.

Determine which type of access you need

There are varying levels and combinations of access. The type of access you need depends on the tools you want to use, and the type of data you need to access.

Follow Data Access Guidelines

You must read and follow these guidelines in all your work with internal data at WMF.

Request access

Follow the process to file an access request for your account.

Query and analyze data

After you have access to internal data and systems, you can start exploring and querying data in the Data Lake.

Use Jupyter on analytics clients

Every analytics client provides a hosted Jupyter environment for interactive notebooks and terminals. WMF also uses a custom Conda distribution to manage packages and virtual environments on the analytics clients.

Follow the instructions at Data_Engineering/Systems/Jupyter to access Jupyter and learn about the available software packages to help with data analysis.

Explore datasets in DataHub

DataHub is a data catalog that enables you to browse datasets in the Data Lake and view table schemas and other metadata. Access it at https://datahub.wikimedia.org.

Run SQL queries

The main way to access the data in the Data Lake is to run queries using one of the three available SQL engines: Presto, Hive, and Spark.

For lightweight analysis tasks, consider using Superset, which has a graphical SQL editor where you can run Presto queries, or Hue, which has a graphical SQL editor where you can run Hive queries.

Use libraries and analysis packages

wmfdata-python and wmfdata-r (available in Jupyter environments on analytics clients)
wmfastr : for speedy dwelltime and search preference metrics calculations in R
waxer: R wrapper for the metrics endpoint of the AQS REST API
MediaWiki-utilities, including tools for parsing HTML and wikitext
Tools for working with the Wikimedia dumps
Resources for IP geolocation and geotagging

Documentation pages for specific data sources may also contain example queries for working with that dataset. For example: wmf.webrequest Sample queries.

Use internal versions of public resources

You can access some popular public data sources more quickly and efficiently by using these internal data platform tools or datasets.

For a full overview of the types of data available internally and publicly, see Discover data.

Internal pageviews data

Public pageviews data is available through dumps, APIs, and dashboards, but you can access more granular data internally in the wmf.pageview_hourly Hive table.

Internal MediaWiki dumps

The wmf database contains internal versions of the public data dumps published at dumps.wikimedia.org. The internal tables include raw and preprocessed edits data. For example, wmf.mediawiki_wikitext_history provides an internal version of the public XML dumps, refined into Avro data.

Internal events data

Internal users can access EventLogging datasets stored in the event and event_sanitized Hive databases, instead of using the public Event Streams service.

Internal MediaWiki API requests

Query the MediaWiki APIs internally in R and Python, rather than sending requests over the internet.

Next steps

To learn about how to publish and share your analyses through dashboards, visualizations, and more, see Publish data.