User:Milimetric/MediaWiki db schema/Thoughts

task T370948 guessed that, in the process of reconciling data from MediaWiki databases against events published by EventBus, we might come up with ideas for how the MediaWiki database schema could change to make these types of data pipelines easier to implement. The following will summarize our thoughts along these lines.

First, the problem is that we are simultaneously heading in multiple directions. We are currently considering:

re-thinking MediaWiki hooks. This might make EventBus's work of publishing data from MediaWiki more straightforward and in turn remove the need for some steps in the Dumps 2.0 pipeline. If, additionally, delivery of these events was guaranteed or easy to reconcile, the need for interacting with MediaWiki schema would vanish.
Change Data Capture (CDC: listening to the binlog that MediaWiki dbs are publishing for replication and using Kafka as a sink). This would not solve everything (eg. it would still require us to re-implement MW database interactions far away from the comforts of its PHP API). But it would eliminate the need for us to make any schema changes as reconciliation would happen automatically as part of CDC.
No change: we keep copying large amounts of data from MediaWiki in batches and we don't change how we publish events via EventBus.

Changing MediaWiki's schema is a big deal. If we decide on one of the first two paths, a change would only be temporarily useful. For this reason, the most important point we make here is that we probably shouldn't change the MW schema to make reconciliation easier. Still, as a thought exercise, here's what we might do and why it would help the Dumps 2 reconciliation jobs.

rev_touched

One of the tricky things about the revision table is finding out what rows changed in a given time period, x. An indexed timestamp column, analogous to page_touched, would make these queries trivial and allow downstream consumers to pull a lot less data to accomplish what they need. To get around the lack of this, Dumps 2 currently looks at new records in the revision table and different types of logging actions like delete / revision to approximate something like rev_touched. Other pipelines, like the MediaWiki History dataset, just copy the whole revision table and therefore only run once a month to limit impact.

rev_archived

The unification of delete systems makes this redundant, perhaps, just including for completeness. The way revisions move from archive to revision complicates certain queries. In addition to rev_touched proposed above, it would be useful to flag whether a revision is part of a deleted page or not via a simple rev_archived flag. The page table would then need something like page_deleted. With this, instead of having to query two tables, all revisions changed during a given time period could be retrieved in one query. And this would be true of any kind of change: merge, delete, move, revision delete, and create.

Conclusion

For Dumps 2, that would simplify all the queries we run on the DB replicas. For other pipelines, such as MW History, this in addition to backfilling some normalized data would help us remove a lot of complicated code. For example, we infer page ids based on page titles when the ids are missing from the logging table. Recent rules on the logging table require page ids where they're needed. So checking and backfilling the inferred ids would help both MW and downstream pipelines.

Some of these changes have been proposed elsewhere and may be good ideas for MW itself. We're happy to discuss this with anyone interested, given the caveat above that we don't think these changes should happen just to support better reconciliation.