Jump to content

Commons Impact Metrics/Data Model

From Wikitech

This page describes the data model of the Commons Impact Metrics data product. It includes detailed descriptions of all dimensions and metrics as well as the detailed schemas of the 5 base datasets stored in Hive (Iceberg) and exported as dumps. It does not include information about the Commons Impact Metrics API (AQS service) or the corresponding Cassandra tables, although some of the definitions are shared. It also does not include contextual information on the project, caveats, nor how to access the data. For find more general information about the project, see Commons Impact Metrics.

Field glossary

The following table contains detailed definitions for all fields of the Commons Impact Metrics data model. They are valid for all datasets and datastores down the data pipeline: Hive (Iceberg), Cassandra, AQS and dumps.

Field name Description
category The name (or title) of a category, as it appears in the URL of the corresponding category page in Commons. For example: Images_from_Memòria_Digital_de_Catalunya is the category name of https://commons.wikimedia.org/wiki/Category:Images_from_Memòria_Digital_de_Catalunya.

This normalized version of the category name (with underscores instead of spaces) is the standard that is used across all Commons Impact Metrics datasets and datastores.

category_scope It is either "shallow" or "deep". It always goes together with a category field and a metric count field.
shallow Indicates that the associated metric count pertains to the associated category by itself, and not to the category tree.
deep Indicates that the associated metric count pertains to the associated full category tree (the category itself and all its descendants).
parent_categories

or categories

It goes together with a category or a media file field. It is a list of the categories that are a direct parent to the associated field. They use the same format as the category field described above. Note: When the associated field is a category, this field is called parent_categories. When the associated field is a media file, this field is called categories.
primary_categories It goes together with a category or a media_file field. It is a list of the categories that are a top ancestor to the associated field. They use the same format as the category field described above. Note: Top ancestors are the ones listed in the Commons Impact Metrics allow-list by definition.
media_file The name (or title) of a media file, as it appears in the URL of the corresponding media file page in Commons. For example: A_la_font_de_Torrentsenta_a_Gósol_(cropped).jpg is the media file name of https://commons.wikimedia.org/wiki/File:A_la_font_de_Torrentsenta_a_Gósol_(cropped).jpg.

This normalized version of the media file name (with underscores instead of spaces) is the standard that is used across all Commons Impact Metrics datasets and datastores.

media_type Indicates the type of media file. The values are directly imported from the img_media_type field in MediaWiki's image table, i.e. "BITMAP", "VIDEO", etc.
user_name A user name, as it appears in the actor_name field of MediaWiki's actor table. It goes together with an edit event. If no actor is found for the edit, user_name is ":anonymous:". If the edit's actor has been suppressed, user_name is ":redacted:". The colon symbols are added to those special values to prevent collisions with existing user names.
edit_type Indicates the type of edit to a media file. Either “create”, for the first revision of a media file page; or “update”, for all subsequent revisions of the media file page.
wiki The canonical name of a wiki. It indicates the language and the family of the wiki in the format "<language>.<family>". I.e. “en.wikipedia” or “ja.wiktionary”.
page_title The title of a wiki page, as it appears in the URL of the corresponding page in some wiki. For example: दिल्ली is the page title of https://hi.wikipedia.org/wiki/दिल्ली. This normalized version of the page title (with underscores instead of spaces) is the standard that is used across all Commons Impact Metrics datasets and datastores. Note: This data product only reports on namespace=0 wiki pages.
year_month For the aggregated datasets, Pageviews per Category and Pageviews per Media File, year_month indicates the month for which the data is aggregated. For the snapshot based datasets, Category Metrics Snapshot and Media File Metrics Snapshot, year_month indicates the month at the end of which the snapshot was taken. In all cases the format is "yyyy-MM", i.e. "2024-01".
dt The timestamp of an edit. It has different formats depending on the datastore. In Hive (Iceberg) it is of type TIMESTAMP (granularity in seconds). In the dumps is has the format "yyyy-MM-dd'T'HH:mm:ss". It is not reported via AQS or dumps.
pageview_count This metric is always associated with a given Commons media file, or a given Commons category. When it's associated with a media file, it counts the number of pageviews to wiki pages containing that media file. When it's associated with a category, it counts the number of pageviews to wiki pages containing any media file belonging to that category. More specifically, any media file directly associated with the category, if category_scope="shallow"; and any media file within the whole associated category tree, if category_scope="deep".

The pageview_count metric can be broken down by wiki, by page_title and by year_month. It is also an additive metric, so it can be aggregated across those same 3 dimensions. For instance, if the pageview_count metric value for January is P1 and for February is P2, then the combined value since start of January until end of February is P1+P2.

The pageview_count metric has a known caveat in the context of this project: monthly drift. Read more about monthly drift.

media_file_count This metric is always associated with a given Commons category. It counts the number of media files directly belonging to that category.
media_file_count_deep This metric is always associated with a given Commons category. It counts the number of media files belonging to that category tree. Meaning the media files can belong to the category itself, or to any of its descendant categories (child, grandchild, etc.).
used_media_file_count This metric is always associated with a given Commons category. It counts the number of media files directly belonging to that category, which appear in at least 1 wiki page.
used_media_file_count_deep This metric is always associated with a given Commons category. It counts the number of media files belonging to that category tree, which appear in at least 1 wiki page. Meaning the used media files can belong to the category itself, or to any of its descendant categories (child, grandchild, etc.).
leveraging_wiki_count This metric is always associated with a given Commons media file, or a given Commons category. When associated with a media file, it counts the number of wikis having at least 1 page that features that media file. When associated with a category, it counts the number of wikis having at least 1 page that features any media file directly belonging to the associated category.
leveraging_wiki_count_deep This metric is always associated with a given Commons category. It counts the number of wikis having at least 1 page that features any media file belonging to the associated category tree (the category itself, its children, grandchildren, etc.).
leveraging_page_count This metric is always associated with a given Commons media file, or a given Commons category. When associated with a media file, it counts the number of wiki pages (namespace=0) featuring that media file. When associated with a category, it counts the number of wiki pages (namespace=0) featuring any media file directly belonging to the associated category.
leveraging_page_count_deep This metric is always associated with a given Commons category. It counts the number of wiki pages (namespace=0) featuring any media file belonging to the associated category tree (the category itself, its children, its grandchildren, etc.).
edit_count This metric is always associated with a given wiki user name, or a given Commons category. When associated with a user name, it counts the number of edits performed by the associated user, to media files belonging to the allow-listed Commons categories. When associated with a category, it counts the number of edits to media files belonging to the associated category. More specifically, edits to media files directly associated with the category, if category_scope="shallow"; and edits to media files within the whole associated category tree, if category_scope="deep".

Base datasets

The Commons Impact Metrics data product consists of 5 base datasets stored in Hive (Iceberg), and also exported in the form of dumps. Two of them (Category metrics snapshot and Media file metrics snapshot) are snapshot-based. Two of them (Pageviews per category monthly and Pageviews per media file monthly) are aggregated cubes. The last one (Edits) is event-based.

Snapshot-based datasets

Snapshot-based datasets capture the state of things at a given point in time. In Commons Impact Metrics, the snapshot is taken at the end of the month in question. The metrics included in snapshot-based datasets are absolute counts since the beginning of time, up to the instant of the snapshot. For instance, the number of media files contained inside a Commons category. Note that such metrics can not be aggregated over time.

Category metrics snapshot

This dataset stores metrics about Commons categories. Each row corresponds to a category (for a given snapshot). It can be an allow-listed (primary) category or one of its subcategories. Only allow-listed (primary) categories report on "deep" metrics, all categories report on "shallow" metrics. The metric values (int) are not aggregatable. All queries to this table should always filter or breakdown by category and year_month.

Schema

Field Type Description
category string The name of the category this row refers to.
parent_categories list<string> The immediate ancestor category names of this row's category.
primary_categories list<string> The top ancestor category names of this row’s category.
media_file_count int The number of media files contained in this category.
media_file_count_deep int The number of media files contained in this category tree. Only available for primary allow-listed categories.
used_media_file_count int The number of media files from this category featured in at least one wiki page.
used_media_file_count_deep int The number of media files from this category tree featured in at least one wiki page. Only available for primary allow-listed categories.
leveraging_wiki_count int The number of wikis featuring at least one of this category’s media files.
leveraging_wiki_count_deep int The number of wikis featuring at least one of this category tree’s media files. Only available for primary allow-listed categories.
leveraging_page_count int The number of pages featuring at least one of this category’s media files.
leveraging_page_count_deep int The number of pages featuring at least one of this category tree’s media files. Only available for primary allow-listed categories.
year_month string The month after which the snapshot was taken (YYYY-MM).

Sample queries

-- Get shallow metrics about a given category.
SELECT
    media_file_count,
    used_media_file_count,
    leveraging_wiki_count,
    leveraging_page_count
FROM category_metrics_snapshot
WHERE
    year_month = "2024-01" AND
    category = "My_category"
;

-- Get a time series that shows the usage of a category tree over time.
-- Note we are not filtering by year_month, but we are breaking down by it.
SELECT
    year_month,
    used_media_file_count_deep
FROM category_metrics_snapshot
WHERE catgegory = "My_allow_listed_category"
GROUP BY year_month
ORDER BY year_month ASC
;

-- Get the category graph for an allow-listed category tree,
-- represented as edges of the form (parent_category, child_category).
SELECT
    EXPLODE(parent_categories) AS parent_category,
    category AS child_category
FROM category_metrics_snapshot
WHERE
    year_month = "2024-01" AND
    ARRAY_CONTAINS(primary_categories, "My_allow_listed_category")
;

Media file metrics snapshot

This dataset stores metrics about Commons media files. Each row corresponds to a media file (for a given snapshot). To reduce the size of the data, media files that are not featured in any wiki page do not appear in this dataset. The metric values (int) are not aggregatable. All queries to this table should always filter or breakdown by media_file and year_month.

Schema

Field Type Description
media_file string The name of the media file this row refers to.
media_type string The media type of the media file.
categories list<string> The immediate ancestor category names of this row's media file.
primary_categories list<string> The top ancestor category names of this row’s media file.
leveraging_wiki_count long The number of wikis featuring this row's media file.
leveraging_page_count long The number of wiki pages (across all wikis) featuring this row's media file.
year_month string The month after which the snapshot was taken (YYYY-MM).

Sample queries

-- Get metrics about a given media file.
SELECT
    leveraging_wiki_count,
    leveraging_page_count
FROM media_file_metrics_snapshot
WHERE
    year_month = "2024-01" AND
    media_file = "My_media_file.jpg"
;

-- Get a time series that shows the leverage of a media file over time.
-- Note we are not filtering by year_month, but we are breaking down by it.
SELECT
    year_month,
    leveraging_page_count
FROM media_file_metrics_snapshot
WHERE media_file = "My_media_file.jpg"
GROUP BY year_month
ORDER BY year_month ASC
;

Aggregated cubes

Aggregated cube datasets store metric aggregations broken down by a set of dimensions. Ideally, the metrics are additive, and you can slice and dice the dataset aggregating across all dimensions including time. In Commons Impact Metrics, the aggregation is done at a monthly granularity, for data size reasons. And the only metric is pageview_count (see its definition in the metrics table above).

Pageviews per category monthly

This dataset stores pageview counts for Commons categories, as explained in the metric definitions section above. The pageview_counts are dimensioned by wiki, page_title and year_month. You can aggregate pageview_counts only across those 3 dimensions. You can not aggregate across the category dimension. Pageviews to a wiki page can be attributed to more than one category, when that page contains media files from several categories. So, aggregating across the category dimension will produce duplicate counts. You can't either aggregate across the category_scope dimension, since the pageview_count metric has different meanings for category_scope="shallow" vs. category_scope="deep" (read more in the metric definitions section above). So, you should always filter or breakdown by both category and category_scope. Only allow-listed (primary) categories report pageviews for category_scope="deep". All categories report pageviews for category_scope="shallow". Pageviews to a wiki's Main page are not counted. Rows with pageview_count=0 are omitted.

Schema

Field Type Description
category string The name of the category this row refers to.
category_scope string The scope of the category; either "shallow" or "deep".
primary_categories list<string> The top ancestor category names of this row’s media file.
wiki string The canonical name of the visualized wiki. Only wikis that feature at least one media file of the corresponding category will appear here.
page_title string The title of the visualized wiki page. Only pages featuring at least one media file of the corresponding category will appear here.
pageview_count long Aggregated pageview count.
year_month string The month for which we aggregate the data (YYYY-MM).

Sample queries

-- Get all-time aggregated pageview counts for a given category tree.
SELECT SUM(pageview_count) AS pageview_count
FROM pageviews_per_category_monthly
WHERE
    category = "My_primary_category" AND
    category_scope = "deep"
;

-- Get a time series that shows the pageviews of a category over time.
SELECT
    year_month,
    pageview_count
FROM pageviews_per_category_monthly
WHERE
    category = "My_category" AND
    category_scope = "shallow"
GROUP BY year_month
ORDER BY year_month ASC
;

-- Get a 2023 rank of the wiki pages with most pageviews
-- containing media files from a given category tree.
SELECT
    wiki,
    page_title,
    SUM(pageview_count) AS pageview_count
FROM pageviews_per_category_monthly
WHERE
    year_month BETWEEN "2023-01" AND "2023-12" AND
    category = "My_primary_category" AND
    category_scope = "deep"
GROUP BY
    wiki,
    page_title
ORDER BY pageview_count DESC
LIMIT 100
;

Pageviews per media file monthly

This dataset stores pageview counts for Commons media files, as explained in the metric definitions section above. The pageview_counts are dimensioned by wiki, page_title and year_month. You can aggregate pageview_counts only across those 3 dimensions. You can not aggregate across the media_file dimension. Pageviews to a wiki page can be attributed to more than one media file, when that page contains several media files. So, aggregating across the media_file dimension will produce duplicate counts. So, you should always filter or breakdown by media_file. Pageviews to a wiki's Main page are not counted. Rows with pageview_count=0 are omitted.

Schema

Field Type Description
media_file string The name of the media file this row refers to.
categories list<string> The parent category names of this row's media file.
primary_categories list<string> The top ancestor category names of this row’s media file.
wiki string The canonical name of the visualized wiki. Only wikis that feature at least one media file of the corresponding category will appear here.
page_title string The title of the visualized wiki page. Only pages featuring at least one media file of the corresponding category will appear here.
pageview_count long Aggregated pageview count.
year_month string The month for which we aggregate the data (YYYY-MM).

Sample queries

-- Get all-time aggregated pageview counts for a given media file.
SELECT SUM(pageview_count) AS pageview_count
FROM pageviews_per_media_file_monthly
WHERE media_file = "My_media_file.jpg"
;

-- Get a time series that shows the pageviews of a media file over time.
SELECT
    year_month,
    pageview_count
FROM pageviews_per_media_file_monthly
WHERE media_file = "My_media_file.jpg"
GROUP BY year_month
ORDER BY year_month ASC
;

-- Get a 2023 rank of the wiki pages with most pageviews
-- containing a given media file.
SELECT
    wiki,
    page_title,
    SUM(pageview_count) AS pageview_count
FROM pageviews_per_media_file_monthly
WHERE
    year_month BETWEEN "2023-01" AND "2023-12" AND
    media_file = "My_media_file"
GROUP BY
    wiki,
    page_title
ORDER BY pageview_count DESC
LIMIT 100
;

Event-based datasets

In event-based datasets each row represents an event, something that happened in the context of the dataset. Event-based datasets always have a field containing the timestamp of the event, with very fine granularity, i.e. milliseconds. Usually, they don't have any metric value.

Edits

This is an event-based dataset. Each row corresponds to an edit event performed on a Commons media file belonging to an allow-listed category tree. The dataset does not contain any metric per se, but you can aggregate row counts across any set of dimensions.

Schema

Field Type Description
user_name string The user name of the user who performed the edit.
edit_type string Either “create” or “update”.
media_file string The name of the edited media file.
categories list<string> The (parent) category names that the media file is directly associated with.
primary_categories list<string> The top ancestor category names of this row’s media file.
dt timestamp The timestamp of the edit.

Sample queries

-- Get all-time aggregated edit counts for a given allow-listed category tree.
SELECT SUM(edit_count) AS edit_count
FROM edits
WHERE ARRAY_CONTAINS(primary_categories, "My_primary_category")
;

-- Get a time series that shows the edits to a media file over time.
SELECT
    month(dt) AS year_month,
    edit_count
FROM edits
WHERE media_file = "My_media_file.jpg"
GROUP BY month(dt)
ORDER BY year_month ASC
;

-- Get a 2023 rank of the user names with most edits
-- to media files belonging to a given category.
SELECT
    user_name,
    SUM(edit_count) AS edit_count
FROM edits
WHERE
    dt BETWEEN TO_TIMESTAMP("2023-01") AND TO_TIMESTAMP("2023-12-31 23:59:59.999") AND
    ARRAY_CONTAINS(categories, "My_category")
GROUP BY user_name
ORDER BY edit_count DESC
LIMIT 100
;