Jump to content

Wiki Replica redaction

From Wikitech

This page is to document how the data is sanitized for the Wiki Replicas public databases that Wikimedia Cloud Services provides.

This page provides a high-level overview for Wiki Replicas users. The detailed admin docs are in Portal:Data Services/Admin/From_production_to_Wiki_Replicas.

Sanitization happens in two steps: first some private data is removed by Sanitarium hosts, then more private data is removed through Wiki Replicas views.

Sanitarium hosts

The replication into the Sanitarium hosts uses triggers and filters to remove sensitive columns, tables and databases in the simple case where there are no conditions (e.g. ensures user_password does not end up in Wiki Replicas).

There is also a check_private_data_report script to make sure redaction happened properly. This runs weekly via cron and emails the DBAs the results when a mismatch is found.

Wiki Replicas views

Wiki Replicas hosts contain both tables and views. Wiki Replicas users only have permissions to access the views. The file maintain-views.yaml contains the view definitions that define what is public and available to end users. This contains conditional redactions that cannot be done by the Sanitarium triggers and filters (e.g. revision delete), and also serves as defense in depth in case one of the Sanitarium redactions fail.

Document redaction decisions

TODO: include documentation/rationale on any info publicly exposed that is not publically exposed by MW.