Jump to content

Data Platform/Analyze data

From Wikitech


This page outlines the tools and systems available for analyzing private Wikimedia data. For public data, see meta:Research:Data.

Key terms

This term list focuses only on what you should know to get started using the Data Platform for analysis. A more comprehensive glossary is at Data_Platform/Systems/Cluster#Glossary.
Analytics clients
Analytics clients (also called "stat boxes") are servers in the production cluster that enable you to access private data and resources for statistical computation. They're called "analytics clients" because they act as clients accessing data from other databases.
Analytics cluster
The "analytics cluster" is a catch-all term for compute resources and services running inside of the Analytics VLAN, which itself is inside of WMF production network. Individual systems within the analytics cluster include Hadoop, and related components that run the Data Lake.
Data Lake
The Data Lake is a large, analytics-oriented repository of Wikimedia data.
Hadoop
A collection of services for batch processing of large data. See Hadoop.
HDFS
A file system for the Hadoop framework, which WMF uses to store files of various formats the Data Lake.
Hive
A system that projects structure onto flat data (text or binary) in HDFS and allows this data to be queried using an SQL-like syntax.


Get access to internal data

Private data lives in same server cluster that runs Wikimedia's production websites. Often, this means you need production access to access it.

There are varying levels and combinations of access. The type of access you need depends on the tools you want to use, and the type of data you need to access.

You must read and follow these guidelines in all your work with internal data at WMF.

Follow the process to file an access request for your account.

Query and analyze data

After you have access to internal data and systems, you can start exploring and querying data in the Data Lake.

Every analytics client provides a hosted Jupyter environment for interactive notebooks and terminals. WMF also uses a custom Conda distribution to manage packages and virtual environments on the analytics clients.

Browse datasets in the Data Lake and view table schemas and other metadata: https://datahub.wikimedia.org.

Run SQL queries

The main way to access the data in the Data Lake is to run queries using one of the three available SQL engines: Presto, Hive, and Spark.

For lightweight analysis tasks, use Superset, which has a graphical SQL editor where you can run Presto queries, or Hue, which has a graphical SQL editor where you can run Hive queries.

Use libraries and analysis packages
Documentation pages for specific data sources may also contain example queries for working with that dataset. For example: wmf.webrequest Sample queries.

Use internal versions of public resources

You can access some popular public data sources more quickly and efficiently by using these internal data platform tools or datasets.

For a full overview of the types of data available internally and publicly, see Discover data.

Public pageviews data is available through dumps, APIs, and dashboards, but you can access more granular data internally in the wmf.pageview_hourly Hive table.

The wmf database contains internal versions of the public data dumps published at dumps.wikimedia.org. The internal tables include raw and preprocessed edits data. For example, wmf.mediawiki_wikitext_history provides an internal version of the public XML dumps, refined into Avro data.

Internal users can access EventLogging datasets stored in the event and event_sanitized Hive databases, instead of using the public Event Streams service.

Internal MediaWiki API requests

Query the MediaWiki APIs internally in R and Python, rather than sending requests over the internet.

Next steps

To learn about how to publish and share your analyses through dashboards, visualizations, and more, see Transform and publish data.