Jump to content

Commons Impact Metrics

From Wikitech

The Commons Impact Metrics data product is a collection of datasets designed to provide insight on the impact of community contributions to Wikimedia Commons. So far, the data is focused on media files uploaded by and categories belonging to GLAM actors (affiliates, projects, individual contributors, etc. related to galleries, libraries, archives, and museums). This page describes the project, the main properties of the data pipeline, and how to access the data and code. If you're looking for developer documentation on the shape of the data and how to query it, see the data model docs.

Project rationale

There has been a long-standing community request for a data product that would give insight into the impact of Commons contributions[1]. While the WMF has not been able to attend the request, the community has created a list of tools which compute such data and serve it via visual web applications: tools such as GLAM Wiki Dashboard, BaGLAMa2 and GLAMorgan. In the couple years before this project, the community has reported that they have difficulties maintaining these tools for several reasons, and the tools have become less useful to the community due to data outages, data inconsistency between tools and the complexity of the calculations. This project aims to improve on those issues by delivering a data product that:

  • Answers most of the use cases covered by the mentioned tools.
  • Is robust, not subject to data outages.
  • Is standardized and can be used consistently across a range of tools.
  • Provides pre-calculated data that is easy to query and manage.

Properties and caveats

Category allow-list

Because of computational complexity, data size and dataset semantics, we have scoped this data product to report only on a list of curated GLAM primary categories. Each of those categories belongs to a GLAM institution, event, contributor, project, etc. The data product will also report on all sub-categories under those listed primary categories, and the media files directly associated with them. The initial allow-list was put together from the existing mentioned tools, but it is open to additions. See the current Commons Impact Metrics allow-list on GitLab.

Allow-list updates

The Commons Impact Metrics category allow-list is open to update requests (addition of new categories and renaming or removal of existing ones). You can request an update to the allow-list here (guidelines for the process here).

New categories should correspond to the primary (top) category of either:

  • A Commons mass contributor actor/entity. For instance, the category of a specific museum, library, individual mass contributor, etc. (“Media_contributed_by_someone” or “National_Museum_of_someplace”).
  • An event or a project aimed at generating Commons mass contribution. For instance, an editathon organized to generate mass contribution. (“Wiki Loves Something”, “Images_uploaded_as_part_of_some_collaboration”).

The category should not refer to other things, such as media locations, media subjects, media formats, tools used to upload media, etc. (“Modern_art”, “Wales”, “Uploaded_with_some_tool”); especially, if the category is vague or overarching, like “Images about art”. Categories like those could quickly compromise the performance of the data pipeline and make the dumps unusable by the community. Exceptions can be made on a case-by-case basis.

Category renames

If an allow-listed category is renamed in Commons, the Commons Impact Metrics pipeline will cease to calculate metrics for it on its next monthly run. To prevent that, the allow-list has to be updated by replacing the old name with the new name before the end of the month. This can be done using the allow-list update process above. Note that even when an allow-listed category is properly renamed, the data collected before the rename will still be associated with the old name, while the data collected after the rename will be associated with the new name.

No retroactive calculations by default

When new categories are added to the allow-list, the pipeline will calculate metrics for them starting at the time of the addition, and going onward. By default, there will be no retroactive data re-runs or back-fills for new categories. Because the calculation needs to happen for all categories at once (not just the new ones) it is expensive and impractical in terms of computation and engineering resources to execute ad-hoc re-runs or back-fills. If necessary, it would be possible to have general re-runs every 6 months, which would back-fill data for the last 6 months.

Max depth

In Commons' category graph, most sub-graphs are interconnected. You can navigate from a sub-graph about a given museum in a given country, and end up in a sub-graph about a project on the other side of the world. In practice, if the allow-list mentioned above is big enough, navigating through the listed sub-graphs without limits might result in traversing the whole of Commons' category graph. Because we want to report on GLAM-specific sub-graphs, we impose a limit to how deep an allow-listed category tree will be considered. Learn more about why in this deep dive on the algorithm. Currently the max depth is 7. This means that this data product will only report on sub-categories that are at a maximum distance of 7 steps from the allow-listed primary category. The data will also report on all media files directly associated to any of those categories and sub-categories.

Aggregated and released monthly

Because of data size reasons, we currently aggregate the data in a monthly granularity. One of the design criteria of this product is that it should be manageable for community members. Usually community members do not have access to a cluster to run queries on top of hundreds of gigabytes of data. Thus we reduced the granularity to monthly to make it lighter and more manageable. On the other hand, because this dataset depends on data that we currently only ingest at a monthly pace, we can only offer a monthly release schedule.

Pageviews vs. Mediarequests

The previously existing tools developed by the community used two different base metrics: mediarequests and pageviews. In an effort to unify into a single metric, the Data Products team analyzed the pros and cons of each metric. The main ones are listed below:

Metric Pros Cons
Mediarequests
  • No monthly drift. Mediarequests are associated directly with media files, which does not cause any monthly drift.
  • Not associated to wiki page. We can not filter or breakdown Mediarequests per wiki page.
  • Less bot filtering. The Mediarequests pipeline filters out self-identified bots, but not automated traffic.
Pageviews
  • Associated to wiki page. Pageviews are associated directly with wiki pages. We can filter and break down Pageviews per wiki page.
  • Better automated traffic detection. Since Pageviews is a core pipeline in WMF's Data Engineering Platform, it benefits from the automated traffic detection pipeline.
  • Monthly drift. Pageviews are associated directly with wiki pages. This causes monthly drift. See the corresponding section for more details.

Data Products chose Pageviews as the base metric for the dataset, because of the combined evaluation of all pros and cons. However at several points during the project, some community members noted that the monthly drift problem was a significant drawback. Data Products agrees with this, and plans to mitigate the monthly drift in the future. Note: Mediarequests (outside the context of Commons Impact Metrics) are already publicly available in the form of dumps and API (AQS).

Monthly drift

The base metric used for the Commons Impact Metrics data product is currently Pageviews. More specifically, Pageviews to wiki pages containing media files categorized under an allow-listed category tree.

The problem comes when a media file belonging to an allow-listed category is added to a wiki page. The only way of knowing the exact date of addition is parsing the wikitext history for media file updates. This can be done in theory, but it would be very difficult and would take a long time and a big engineering effort. Another way we can proxy the date of addition is querying MediaWiki's imagelinks table, and that is what the pipeline does. However, we can only do it at a monthly pace, since the current MediaWiki database imports to Data Engineering's data lake only happen monthly. As a result of all this, we only know the month a media file was added to a wiki page (not the day or hour). So, we can only calculate Pageview aggregations for the full month. So, even when a media file is added (for example, on the 15th of the month), the pipeline will aggregate Pageviews for the corresponding wiki page since the 1st of the month, thus overcounting the Pageviews from the 1st to the 14th.

Note the monthly drift does only happen on the month a media file is added to a wiki page. It doesn't happen in the subsequent months, because the media file will be there from the 1st to the last day of the month and all Pageviews will be rightfully counted. It also does not change the past aggregations for that media file or its associated categories (a known issue of previous tools), it will only reflect them moving forward.

Potential mitigations of the monthly drift include:

  • Adding Mediarequests to the data product
  • Increasing the granularity and release schedule of the data product to daily

How to access the data

The Commons Impact Metrics data pipeline populates different datastores where the data can be queried:

  • the Data Engineering team's data lake;
  • the Analytics Query Service API (AQS); and
  • the Commons Impact Metrics dumps.

At some point, the data is also ingested to Cassandra, but that is just for internal AQS consumption, not for user queries.

Data lake

Data Engineering's data lake stores the base datasets of the Commons Impact Metrics product. They are the basis from which the dumps are formatted, and also from which the Cassandra tables that serve AQS are populated. That said, they can also be directly queried by people who have access to the data lake (who have WMF Kerberos credentials). You don't need to have further "analytics-privatedata-users" permissions to access this data, since it's not private. There are 5 base datasets for Common Impact Metrics (read more about the data model):

Hive database Iceberg table HDFS location Description
wmf_contributors commons_category_metrics_snapshot /wmf/data/wmf_contributors/commons/

category_metrics_snapshot

Metrics about CIM categories.
wmf_contributors commons_media_file_metrics_snapshot /wmf/data/wmf_contributors/commons/

media_file_metrics_snapshot

Metrics about CIM media files.
wmf_contributors commons_pageviews_per_category_monthly /wmf/data/wmf_contributors/commons/

pageviews_per_category_monthly

Aggregated pageview counts for CIM categories.
wmf_contributors commons_pageviews_per_media_file_monthly /wmf/data/wmf_contributors/commons/

pageviews_per_media_file_monthly

Aggregated pageviuew counts for CIM media files.
wmf_contributors commons_edits /wmf/data/wmf_contributors/commons/edits CIM edit events.

Dumps

The Commons Impact Metrics dumps consist of 5 public datasets updated at a monthly schedule. They follow exactly the same data model as the data lake datasets above, and you can find the data model details here. Anyone can download them from https://dumps.wikimedia.org/other/commons_impact_metrics/readme.html. Take into account that:

  • They are formatted in TSV (tab separated values).
  • They are compressed using Bzip2.
  • Some fields contain lists of strings; in which case, the strings are separated by | (pipe) symbols.

API

The Commons Impact Metrics data is also served publicly (without authentication) via the Analytics Query Service API (AQS). The service has 14 endpoints that you can query with different parameters. To use the API, see the Analytics API documentation.

Pipeline architecture

The Commons Impact Metrics data pipeline consists of 4 pieces of software:

  • the transformation of the source data into the base datasets;
  • the generation of the dumps;
  • the transformation and loading of the base data into Cassandra; and
  • the AQS service (which consumes the data in Cassandra).

Base datasets

There is an Airflow Directed Acyclic Graph (DAG) that waits for source data to be present, and processes it to generate the 5 base datasets and store them in Hive (Iceberg) tables. It uses a Spark-Scala module to put the Commons category graph together, and a set of SparkSQL queries (excluding the ones starting with dump_) to compute the final data on top of the category graph. The execution of the DAG happens at a monthly schedule. The source data includes a couple of MediaWiki tables (imported monthly to the data lake via Gobblin), i.e. page, image, imagelinks, categorylinks, etc; and also includes wmf.pageview_hourly and other minor data lake tables.

Dumps

There is another Airflow DAG that triggers once the process for the base datasets above has finished, and produces the new monthly release of the Commons Impact Metrics dumps files. It uses a set of SparkSQL queries (the ones starting with dump_), to extract and format the files. There's also a README file under https://dumps.wikimedia.org/other/commons_impact_metrics/readme.html that can be modified in puppet.

Cassandra loading

There is a third Airflow DAG that also triggers once the base datasets process has finished; and loads data to 14 different Cassandra tables, each designed to serve an AQS endpoint. The DAG uses a set of SparkSQL queries (the ones starting with load_cassandra_commons) to extract and format the data into the expected shape.

AQS service

Finally, we have an AQS service named commons-analytics that serves the data stored in Cassandra through 14 endpoints. The service uses a common generic AQS library named aqsassist, and also this testing environment.

  1. https://commons.wikimedia.org/wiki/Commons:WMF_support_for_Commons/Commons_Impact_Metrics