Jump to content

Commons Impact Metrics/Algorithm

From Wikitech

This page describes the algorithm used by the Commons Impact Metrics data product to traverse the commons category graph. These slides are from a presentation given by Marcel Ruiz Forns at the 2024 Wikimedia Hackathon.

Imagine we have these 3 top-level (primary) categories that we want to generate metrics for: A, B and C.
Let’s call them a1; b1, b2; and c1, c2, c3. To give an example, A could be “Images uploaded by El Prado” and a1, which is a subcategory of A, could be “Paintings in El Prado during 2023”.
And tags them with their ancestors. Note that all subcategories of level 2 (in orange) have 2 ancestors: the immediate one (parent), and the top-level one (primary). And so far so good! The algorithm continues.
It keeps adding new levels of subcategories.
And it reaches a stage where we have a well defined sub-graph for each top-level category. This is what we wanted! However, most of the time, Commons category sub-graphs are way way deeper.
This might seem weird, but is quite common and legitimate. Note that now, subcategories b1 and c2 have inherited the ancestors from their new parent categories! Let’s imagine for the sake of simplicity that there are no more subcategories to visit. Nevertheless, the algorithm needs to propagate those new ancestor chains to the rest of the graph. Because b1 and c2 now have a new set of ancestors, those need to propagate to b1’s and c2’s subcategories respectively.
In this iteration. B3, c4 and c5 are updated with their new ancestors.
And then b5 and c6.
And then b6, b7 and c7. But wait, there’s more! Pay attention to b7. Its ancestors have been updated with all the subcategories from A. And we know b7 is the parent of c2. So the algorithm will propagate b7’s ancestors to c2.
Because c2 is legitimately part of sub-graphs A, B, and C! And we’re still not done.
The algorithm will propagate those new ancestor chains…
…to the end of…
…the graph. Now, if we look at c7, we can see that it has 13 ancestors. And many other subcategories also have a big number of ancestors.
Just because of the subcategory relations that connect the initially independent sub-graphs. And we learned that the Commons category graph, which we thought would be something like…
…this, is actually more like…
…this. Where everything is interconnected. If you start at a given primary category and go deep enough, You will end up connecting to other subgraphs, which in time, are connected with other subgraphs. There’s practically no limit to the depth of the Category graph. This poses a computational problem to our algorithm, which need to traverse huge subgraphs and keep endless lists of ancestor categories.

Because of these computational challenges, we set a couple of boundaries to the calculation.

Instead of calculating metrics for all Commons categories, we do it only for a curated list of categories related to commons mass upload, usually GLAM. Also, we define a maximum “depth” that our algorithm will compute, to avoid going down an infinite rabbit hole of unrelated subcategories. And finally, we aggregate metrics at the coarsest granularity (monthly), to keep the size of the output data small enough for community members to handle without the need of a distributed computation cluster.