Portal:Data Services/Admin/From production to Wiki Replicas

This page contained a detailed explanation of how data flows from the production databases to Wiki Replicas and how data redaction is performed.

This page is mostly targeted at Wiki Replicas admins. For a shorter high-level overview targeted at end users, read Wiki Replica redaction.

(The editable file for this diagram is here)

Step 0: databases in production

The starting situation is that there are databases in production for wiki projects (like wikipedia, wikidata, wikictionary, and friends). We would like to provide this same databases for WMCS users. Due to privacy reasons, some data needs to be redacted or deleted. That's why users can't directly access this database.

So, we choose what databases to copy to wiki-replicas, which are all of them.

Every time a new database is created in production (for example, a new language for a wiki) we are in this step 0.

The identification of new database candidates for migrating to wiki-replicas is done under request by someone. Right now there aren't any mechanisms to notify pending migrations or the like.

Step 1: sanitization

Sanitariums are special database hosts managed by the Data Persistence team that contain a copy of production databases but without any private information (e.g. emails, passwords, etc.)

There are 2 Sanitarium hosts in Eqiad and 2 in Codfw. Each Sanitarium host runs 4 MariaDB services on different ports (one service per db section). For example, db1154 contains the sections s1, s3, s5 and s8.

In Sanitariums, MariaDB is set to replicate from a production database using filters to avoid replicating private databases and tables, and triggers to redact the values of private columns during INSERTs and UPDATEs:

  • Databases that should not be replicated (private wikis) are added to replicate-wild-ignore-table using the $private_wikis puppet variable. (Note: this is separate from private.dblist)
  • Tables that should not be replicated are added to replicate-wild-ignore-table using the $private_tables puppet variable.
  • Columns that contain private information are redacted via triggers that are set based on the list of columns at filtered_tables.txt, using the redact_sanitarium script.

Having this redaction done on a separate host outside of Cloud Services helps isolate the security of the data and ensure a privilege escalation via the Cloud Services access does not compromise the most sensitive data in the production databases.

Please note that some additional filtering of private information is done in the downstream Wiki Replica hosts (see Step 6 below).

Step 2: evaluation

Once data is in sanitarium boxes, some cron jobs and manual scripts check whether data is actually redacted. For example, check that a given column is NULL.

Involved code:

This evaluation also happens in the wiki-replica servers, and alerts in case some private data is detected.

The main production team, DBAs, are in charge of this step.

Step 3: filling up wiki-replicas

Data is finally copied to wiki-replica servers. The DB servers are currently using row-based replication from sanitization boxes to wiki-replicas.

This is done in real time by the DB daemons, so there is always a data flow, which should be sanitized due to previous steps.

Step 4: setting up GRANTs

At this point, database GRANTS are created by main operations team DBAs.

This is done by hand, no automation using puppet, but there is a file where GRANTS are being tracked: modules/role/templates/mariadb/grants/wiki-replicas.sql.

The content of the file is something like the following:

Related tickets:

Step 5: setting indexes

Create indexes that only exist on the wiki replicas via the maintain-replica-indexes script. This script manages the indexes in an idempotent fashion (if you run it multiple times it only changes what doesn't match the configuration file here index-conf.yaml). This could use the ability to handle only a single database, but it skips tables it finds comply with the definitions in the config, so it is fairly quick to run, despite that it runs across the whole set.

On each wiki-replica server, run:

% sudo /usr/local/sbin/maintain-replica-indexes

Step 6: setting views

Create _p (public) views, which are intermediate views which leave out private data.

This is done by means of the maintain-views.py script.

In each wiki-replica server, it's executed like this:

% sudo maintain-views --databases $wiki --debug

Running the maintain-views script/cookbook is handled by the WMCS team, with Data Engineering providing input on what should and should not be exposed.

TODO: elaborate info on what is this doing

Step 7: setting up metadata

Insert a new row in meta_p.wiki for the new wiki by running the maintain-meta_p script. This is only relevant for S7 databases.

The execution is like this:

% sudo /usr/local/sbin/maintain-meta_p --databases $wiki. 

If the meta_p doesn't exist, you need to run it with --boostrap option.

For example, in case of a full bootstrap:

user@cumin1001:~$ sudo cumin "P{R:Profile::Mariadb::Section = 's7'} and P{P:wmcs::db::wikireplicas::mariadb_multiinstance}" "/usr/local/sbin/maintain-meta_p --all-databases --bootstrap"
3 hosts will be targeted:
clouddb[1014,1018,1021].eqiad.wmnet
[..]

This step is handled by the WMCS team.

TODO: elaborate info on what is this doing

Step 8: setting up DNS

From an openstack control server (for example cloudcontrol1004.wikimedia.org), run:

$ sudo /usr/local/sbin/wikireplica_dns --aliases --shard <sN>


To update the service IP used for the analytics or web servers:

  • Edit modules/openstack/files/util/wikireplica_dns.yaml to change the service IP
  • Run /usr/local/sbin/wikireplica_dns
root@cloudcontrol1007:~# /usr/local/sbin/wikireplica_dns -v --zone web.db.svc.eqiad.wmflabs.
2018-03-16T14:02:41Z mwopenstackclients.DnsManager INFO    : Updating s8.web.db.svc.eqiad.wmflabs.
2018-03-16T14:02:42Z mwopenstackclients.DnsManager INFO    : Updating s3.web.db.svc.eqiad.wmflabs.
2018-03-16T14:07:17Z mwopenstackclients.DnsManager INFO    : Updating s2.web.db.svc.eqiad.wmflabs.
2018-03-16T14:07:23Z mwopenstackclients.DnsManager INFO    : Updating s1.web.db.svc.eqiad.wmflabs.
2018-03-16T14:07:24Z mwopenstackclients.DnsManager INFO    : Updating s7.web.db.svc.eqiad.wmflabs.
2018-03-16T14:07:28Z mwopenstackclients.DnsManager INFO    : Updating s6.web.db.svc.eqiad.wmflabs.
2018-03-16T14:07:29Z mwopenstackclients.DnsManager INFO    : Updating s5.web.db.svc.eqiad.wmflabs.
2018-03-16T14:07:30Z mwopenstackclients.DnsManager INFO    : Updating s4.web.db.svc.eqiad.wmflabs.

This step is done by the WMCS team.

TODO: elaborate info on what is this doing

Step 9: all is done

All is done, wiki-replica contains a mirror of the production database. Finally WCMS users/tools/projects are able to query the database/tables.

This is usually done by using the sql wrapper script.

TODO: the benefits of using the sql script.