Portal:Data Services/Admin/Wiki Replicas
This page holds most of the knowledge we have regarding Wiki Replicas for admins.
Service architecture by layers
From production to Wiki Replicas
Data from production databases is replicated to Wiki Replicas through intermediate "Sanitarium" hosts that sanitize the data removing sensitive information. The data flow is described in detail in the page From production to Wiki Replicas.
Physical database server layer
There are currently 8 servers in this cluster. Each of the nodes hosts two database sections. Since there are 8 sections in the scope of the Wikireplicas service, that means that they are able to function as sets of four that can act as redundant standby for one another. These are all Dell PowerEdge R440s, built to Data Persistence specs.
Web (realtime workloads)
Analytics (long-processing workloads)
DB layer
All database servers run MariaDB.
The multi-instance servers have a somewhat complex layout that is documented in each database server's host hiera file (eg. clouddb1013.yaml). They are multi-instance in the sense that each server has two mariadb database instances running at the same time on nonstandard ports (see: profile::mariadb::section_ports). Each instance is named for and hosts one database section.
Proxy layer
Access to the wiki replicas happens via the cloudlb load balancer. Each section/type pair gets assigned its own service VIP in the cloud-private service VIPs range.
Pooling status of the backend database servers is managed via conftool and has a dedicated runbook available.
DNS
DNS is an important component of helping our users find their server in multi-instance. Each wiki database has a CNAME record in the form of:
${WIKI-DATABASE}.{analytics,web}.db.svc.wikimedia.cloud.
The CNAME points to the appropriate section's DNS, in the form of:
s${SECTION_NUMBER}.{analytics,web}.db.svc.wikimedia.cloud.
The section's DNS is an A record which points to a distinct IP address on the proxy layer (cloudlbXXXX
).
Example:
$ dig +short eswiki.analytics.db.svc.wikimedia.cloud
s7.analytics.db.svc.wikimedia.cloud.
172.20.255.8
Updating the DNS configuration
If you need to update the DNS configuration for Wiki Replicas, e.g. to add a new wikidb alias, refer to the page Wiki Replica DNS.
Account management (maintain-dbusers)
We maintain the list of accounts to access on the cloudcontrols (only one acting as primary at a time). The script writes out a $HOME/replica.my.cnf
file to each Toolforge and PAWS user and Toolforge tool home containing MariaDB connection credentials by using an API that each project NFS server is running. This uses LDAP to get a list of accounts to create.
The credential files are created with the immutable bit set with chattr to prevent deletion by the Tool account.
The code pattern here is that you have a central data store (the db), that is then read/written to by various independent functions. These functions are not 'pure' - they could even be separate scripts. They mutate the DB in some way. They are also supposed to be idempotent - if they have nothing to do, they should not do anything.
Most of these functions should be run in a continuous loop, maintaining mysql accounts for new tool/user accounts as they appear.
Some stats can be found in grafana.
populate_new_accounts
- Find list of tools/users (From LDAP) that aren't in the `accounts` table
- Create a replica.my.cnf for each of these tools/users
- Make an entry in the `accounts` table for each of these tools/users
- Make entries in `account_host` for each of these tools/users, marking them as absent
create_accounts
- Look through `account_host` table for accounts that are marked as absent
- Create those accounts, and mark them as present.
If we need to add a new labsdb, we can do so the following way:
- Add it to the config file
- Insert entries into `account_host` for each tool/user with the new host.
- Run `create_accounts`
In normal usage, just a continuous process running `populate_new_accounts` and `create_accounts` in a loop will suffice.
TODO:
- Support for maintaining per-tool restrictions (number of connections + time)
Who admins what
Wiki Replicas are jointly managed by the following teams: WMCS, Data persistence and Data Engineering.
The following table describes the current split of responsibilities.
Area | Ownership | Notes |
Communication with customers | WMCS | |
Hardware spec definition / review | Data Persistence | |
Hardware provisioning | WMCS | Includes DC Ops interactions |
Hardware maintenance | WMCS | Includes DC Ops interactions |
Hardware decommissioning | WMCS | Includes DC Ops interactions |
Monitoring - new hosts | Data Persistence | Zarcillo, Tendril. When Orchestrator is in place this will be automated. |
Monitoring - DB generic | Data Persistence | |
Monitoring - CloudDB-specific | WMCS | |
OS install verification against spec | WMCS | |
Puppet updates - DB generic | Data Persistence | |
Puppet updates - CloudDB-specific | WMCS | |
Data migration | Data Persistence | Populating new hosts with data |
Replication set up | Data Persistence | |
Alarm triage | WMCS | |
Alarm adjustments | WMCS | |
OS-level troubleshooting | WMCS | |
DB-level troubleshooting | WMCS, escalate to Data Persistence if needed | |
Schema updates | Data Persistence | Through replication |
View creation / updates | Data Engineering | Data Engineering assisting with defining views and what data should be redacted |
Applying view changes | Data Platforms SRE | Apply changes to views to the various wiki replicas |
Performance tuning | WMCS, escalate to Data Persistence if needed | |
Backup / Restore | Data Persistence | Restore from prod via Sanitarium |
Removing PII data | Data Persistence | WMCS cc'd on the alert |
OS updates (to level tested in production) | WMCS, coordinated with Data Persistence | Coordination to ensure we're running compatible versions |
DB version updates (to level tested in production) | Data Persistence, coordinated with WMCS | Coordination to ensure we're running compatible versions |
Admin guide
Docs to perform common tasks related to this service. As detailed as possible.
Adding a wiki
- Documented at Add_a_wiki#Cloud_Services
User connection limits
There are two important aspects to this.
- The review process, which is that requests for expanding the number of connections per tool be discussed at the weekly WMCS planning meeting for approval with a requirement of a +1 from a member of the DBA team after WMCS approval. This should be a restrictive process that errs on the side of denial. Only tools with a pretty good reason and significant user base should use additional connections beyond the default limit of 10.
- Keeping a record of the change in modules/role/templates/mariadb/grants/wiki-replicas.sql. In the file as of this writing, line 38 provides a model for a comment (including the Phabricator task number and user) as well as the exact command that is run on the replicas. For consistency, the command should be run on all four replica hosts.
The process after approval is:
- Determine the mysql username of the tool, which should be in any approved request. This will be something similar to
s52788
and not something liketools.flubber
. It can be found in the tool's$HOME/replica.my.cnf
. - The clinic-duty person logs into the root mysql shell for each replica and runs
GRANT USAGE ON *.* TO '$mysql_username'@'%' WITH MAX_USER_CONNECTIONS $number;
Please note that you can always view grants on the server you are on withSHOW GRANTS FOR '$mysql_username';
- Record the change in modules/role/templates/mariadb/grants/wiki-replicas.sql
- To persist the change on rebuilds, add the new value to hieradata/role/common/wmcs/nfs/primary.yaml
- Communicate with the user that they should be all set!
De-pooling Replica Servers
See Portal:Data Services/Admin/Runbooks/Depool wikireplicas
Updating views
See Portal:Data Services/Admin/Runbooks/Deploy wiki replicas view change
Update a wiki's section
Although it is a rare operation, sometimes production wikis get moved from one section to a different one. For example, if two wikis get moved from s3 to s5, the following steps need to taken on the wikireplicas:
- Update
modules/profile/files/openstack/base/pdns/recursor/labsdb.zone
, follow this example: https://gerrit.wikimedia.org/r/619627 - Run
/usr/local/sbin/maintain-meta_p
on each wikireplica host- Double check that the slices look good and the new wikis are on the correct slice, in this case, s5:
sql meta_p select dbname, slice from wiki where dbname in ('muswiki', 'mhwiktionary');
- From a cloudcontrol node (like cloudcontrol1004.wikimedia.org) run:
wmcs-wikireplica-dns --aliases --shard s3 wmcs-wikireplica-dns --aliases --shard s5
- Double check that the wikis appear on their new section using https://replag.toolforge.org/
History
Before September 2017 we had a cluster composed of three physical servers: labsdb1001, labsdb1002 and labsdb1003. Each host contained a copy of all database sections combined into a single database service which also allowed users to add their own custom tables to each wiki's namespace. DNS names for connecting to these servers used a ${wikidb}.labsdb naming convention. These legacy service names survive as CNAME pointers to the appropriate current (2020) s${SECTION_NUMBER}.analytics.db.svc.wikimedia.cloud name. Records are only maintained for wikis which existed at the time of the migration to the 2017 system.
In 2017 we deployed a new cluster of three physical servers with faster hardware, improved replication, and new service names. Each host still contained a merged view of all database sections under a single MariaDB process. User created tables were not allowed in this redesign to improve the ability to shift request traffic from one instance to another. This redesign also introduced the "web" vs "analytics" split for performance tuning. Read more details in the New Wiki Replica servers ready for use blog post from the launch announcement.
In 2021 we deployed the "2020" cluster of 8 physical servers with a new "multiinstance" design. Multiinstance separates database sections into distinct MariaDB instances with two instances hosted on each physical server. See News/Wiki Replicas 2020 Redesign for more information about the user facing changes this required.
In 2023-2024 we updated the load balancing layer to the cloudlb setup, previously there were lots of more layers and manual steps involved.
See also
- End user docs: Wiki Replicas
- Help:Toolforge/Database
- MariaDB/Sanitarium and clouddb instances
- Wiki Replica redaction
- Add a wiki#Cloud Services
- Portal:Data Services/Admin/Wiki Replica DNS
- phab:T180513 "Document wiki-replicas architecture for future automation"
- News/Wiki Replicas 2020 Redesign
- 2017 cluster user announcement