User:Triciaburmeister/Sandbox/Data platform/Analyze data
This page's contents have been moved to the mainspace at Data_Platform. See project history in phab:T350911. |
This page outlines the tools and systems available for analyzing private Wikimedia data. For public data, see meta:Research:Data.
Key terms
- Analytics clients
- Analytics clients (also called "stat boxes") are servers in the production cluster that enable you to access private data and resources for statistical computation. They're called "analytics clients" because they act as clients accessing data from other databases.
- Analytics cluster
- The "analytics cluster" is a catch-all term for compute resources and services running inside of the Analytics VLAN, which itself is inside of WMF production network. Individual systems within the analytics cluster include Hadoop, and related components that run the Data Lake.
- Data Lake
- The Data Lake is a large, analytics-oriented repository of Wikimedia data.
- Hadoop
- A collection of services for batch processing of large data. See Hadoop.
- HDFS
- A file system for the Hadoop framework, which WMF uses to store files of various formats the Data Lake.
- Hive
- A system that projects structure onto flat data (text or binary) in HDFS and allows this data to be queried using an SQL-like syntax.
Get access to internal data
Private data lives in same server cluster that runs Wikimedia's production websites. Often, this means you need production access to access it.
There are varying levels and combinations of access. The type of access you need depends on the tools you want to use, and the type of data you need to access.
You must read and follow these guidelines in all your work with internal data at WMF.
Follow the process to file an access request for your account.
Query and analyze data
After you have access to internal data and systems, you can start exploring and querying data in the Data Lake.
Every analytics client provides a hosted Jupyter environment for interactive notebooks and terminals. WMF also uses a custom Conda distribution to manage packages and virtual environments on the analytics clients.
Follow the instructions at Data_Engineering/Systems/Jupyter to access Jupyter and learn about the available software packages to help with data analysis.
DataHub is a data catalog that enables you to browse datasets in the Data Lake and view table schemas and other metadata. Access it at https://datahub.wikimedia.org.
The main way to access the data in the Data Lake is to run queries using one of the three available SQL engines: Presto, Hive, and Spark.
- Quickstart notebook
- Syntax differences between query engines
- Query examples
- Query and coding conventions
For lightweight analysis tasks, consider using Superset, which has a graphical SQL editor where you can run Presto queries, or Hue, which has a graphical SQL editor where you can run Hive queries.
- wmfdata-python and wmfdata-r (available in Jupyter environments on analytics clients)
- wmfastr : for speedy dwelltime and search preference metrics calculations in R
- waxer: R wrapper for the metrics endpoint of the AQS REST API
- MediaWiki-utilities, including tools for parsing HTML and wikitext
- Tools for working with the Wikimedia dumps
- Resources for IP geolocation and geotagging
Use internal versions of public resources
You can access some popular public data sources more quickly and efficiently by using these internal data platform tools or datasets.
For a full overview of the types of data available internally and publicly, see Discover data.
Public pageviews data is available through dumps, APIs, and dashboards, but you can access more granular data internally in the wmf.pageview_hourly
Hive table.
The wmf
database contains internal versions of the public data dumps published at dumps.wikimedia.org. The internal tables include raw and preprocessed edits data. For example, wmf.mediawiki_wikitext_history
provides an internal version of the public XML dumps, refined into Avro data.
Internal users can access EventLogging datasets stored in the event
and event_sanitized
Hive databases, instead of using the public Event Streams service.
Query the MediaWiki APIs internally in R and Python, rather than sending requests over the internet.
Next steps
To learn about how to publish and share your analyses through dashboards, visualizations, and more, see Publish data.