Analytics
Analytics is the systematic computational analysis of data or statistics, for the purposes of discovery, interpretation, and communication of meaningful patterns.
In the context of the Wikimedia Foundation, the term Analytics generally refers to work carried out on the Analytics Cluster and the Data Lake by various WMF staff and volunteers.
The Data Platform Engineering team has responsibility for managing the Analytics Cluster and the Data Lake, so most pages under /Analytics are now of historical interest only.
Analytics Cluster
The Analytics Cluster comprises a number of different systems geared to help researchers, data scientists, machine learning engineers and other authorized parties to access the data lake.
If you believe that you need access to the cluster, please refer to Data Platform/Data access
Data Lake
The term Data Lake refers to the set of data files (also referred to as datasets) that are stored on the Hadoop HDFS file system.
Many of these datasets are managed by the Data Platform Engineering team with pipelines deployed to production and monitored.
However, members of the analytics-privatedata-users
group may also create their own data files in Hadoop, enabling custom Hive tables plus manipulation of data from Jupyter and Spark etc.
See also
- https://meta.wikimedia.org/wiki/Research_and_Decision_Science which includes the Movement Insights and Product Analytics teams.
- The Product Analytics style guide has code conventions for SQL, Python, and R. Consider adopting them to make your work more consistent and easier for others in the movement to read!