Jump to content

Portal:Data Services/Admin/Wiki Replicas

From Wikitech

This page holds most of the knowledge we have regarding Wiki Replicas for admins.

Service architecture by layers

From production to Wiki Replicas

Data from production databases is replicated to Wiki Replicas through intermediate "Sanitarium" hosts that sanitize the data removing sensitive information. The data flow is described in detail in the page From production to Wiki Replicas.

Physical database server layer

There are currently 8 servers in this cluster. Each of the nodes hosts two database sections. Since there are 8 sections in the scope of the Wikireplicas service, that means that they are able to function as sets of four that can act as redundant standby for one another. These are all Dell PowerEdge R440s, built to Data Persistence specs.

Web (realtime workloads)

Analytics (long-processing workloads)

Despite using the same clouddb* prefix, clouddb2002-dev.codfw.wmnet is not a Wiki Replica host. It is used by labtestwikitech and it should be eventually decommissioned, tracked in phab:T369308

DB layer

All database servers run MariaDB.

The multi-instance servers have a somewhat complex layout that is documented in each database server's host hiera file (eg. clouddb1013.yaml). They are multi-instance in the sense that each server has two mariadb database instances running at the same time on nonstandard ports (see: profile::mariadb::section_ports). Each instance is named for and hosts one database section.

Proxy layer

Access to the wiki replicas happens via the cloudlb load balancer. Each section/type pair gets assigned its own service VIP in the cloud-private service VIPs range.

Pooling status of the backend database servers is managed via conftool and has a dedicated runbook available.

DNS

DNS is an important component of helping our users find their server in multi-instance. Each wiki database has a CNAME record in the form of:

${WIKI-DATABASE}.{analytics,web}.db.svc.wikimedia.cloud.

The CNAME points to the appropriate section's DNS, in the form of:

s${SECTION_NUMBER}.{analytics,web}.db.svc.wikimedia.cloud.

The section's DNS is an A record which points to a distinct IP address on the proxy layer (cloudlbXXXX).

Example:

$ dig +short eswiki.analytics.db.svc.wikimedia.cloud
s7.analytics.db.svc.wikimedia.cloud.
172.20.255.8

Updating the DNS configuration

If you need to update the DNS configuration for Wiki Replicas, e.g. to add a new wikidb alias, refer to the page Wiki Replica DNS.

Account management (maintain-dbusers)

We maintain the list of accounts to access on the cloudcontrols (only one acting as primary at a time). The script writes out a $HOME/replica.my.cnf file to each Toolforge and PAWS user and Toolforge tool home containing MariaDB connection credentials by using an API that each project NFS server is running. This uses LDAP to get a list of accounts to create.

The credential files are created with the immutable bit set with chattr to prevent deletion by the Tool account.

The code pattern here is that you have a central data store (the db), that is then read/written to by various independent functions. These functions are not 'pure' - they could even be separate scripts. They mutate the DB in some way. They are also supposed to be idempotent - if they have nothing to do, they should not do anything.

Most of these functions should be run in a continuous loop, maintaining mysql accounts for new tool/user accounts as they appear.

Some stats can be found in grafana.

populate_new_accounts

  • Find list of tools/users (From LDAP) that aren't in the `accounts` table
  • Create a replica.my.cnf for each of these tools/users
  • Make an entry in the `accounts` table for each of these tools/users
  • Make entries in `account_host` for each of these tools/users, marking them as absent

create_accounts

  • Look through `account_host` table for accounts that are marked as absent
  • Create those accounts, and mark them as present.

If we need to add a new labsdb, we can do so the following way:

  • Add it to the config file
  • Insert entries into `account_host` for each tool/user with the new host.
  • Run `create_accounts`

In normal usage, just a continuous process running `populate_new_accounts` and `create_accounts` in a loop will suffice.

TODO:

 - Support for maintaining per-tool restrictions (number of connections + time)

Who admins what

Wiki Replicas are jointly managed by the following teams: WMCS, Data persistence and Data Engineering.

The following table describes the current split of responsibilities.

Area Ownership Notes
Communication with customers WMCS
Hardware spec definition / review Data Persistence
Hardware provisioning WMCS Includes DC Ops interactions
Hardware maintenance WMCS Includes DC Ops interactions
Hardware decommissioning WMCS Includes DC Ops interactions
Monitoring - new hosts Data Persistence Zarcillo, Tendril. When Orchestrator is in place this will be automated.
Monitoring - DB generic Data Persistence
Monitoring - CloudDB-specific WMCS
OS install verification against spec WMCS
Puppet updates - DB generic Data Persistence
Puppet updates - CloudDB-specific WMCS
Data migration Data Persistence Populating new hosts with data
Replication set up Data Persistence
Alarm triage WMCS
Alarm adjustments WMCS
OS-level troubleshooting WMCS
DB-level troubleshooting WMCS, escalate to Data Persistence if needed
Schema updates Data Persistence Through replication
View creation / updates Data Engineering Data Engineering assisting with defining views and what data should be redacted
Applying view changes Data Platforms SRE Apply changes to views to the various wiki replicas
Performance tuning WMCS, escalate to Data Persistence if needed
Backup / Restore Data Persistence Restore from prod via Sanitarium
Removing PII data Data Persistence WMCS cc'd on the alert
OS updates (to level tested in production) WMCS, coordinated with Data Persistence Coordination to ensure we're running compatible versions
DB version updates (to level tested in production) Data Persistence, coordinated with WMCS Coordination to ensure we're running compatible versions

Admin guide

Docs to perform common tasks related to this service. As detailed as possible.

Adding a wiki

User connection limits

There are two important aspects to this.

  1. The review process, which is that requests for expanding the number of connections per tool be discussed at the weekly WMCS planning meeting for approval with a requirement of a +1 from a member of the DBA team after WMCS approval. This should be a restrictive process that errs on the side of denial. Only tools with a pretty good reason and significant user base should use additional connections beyond the default limit of 10.
  2. Keeping a record of the change in modules/role/templates/mariadb/grants/wiki-replicas.sql. In the file as of this writing, line 38 provides a model for a comment (including the Phabricator task number and user) as well as the exact command that is run on the replicas. For consistency, the command should be run on all four replica hosts.

The process after approval is:

  • Determine the mysql username of the tool, which should be in any approved request. This will be something similar to s52788 and not something like tools.flubber. It can be found in the tool's $HOME/replica.my.cnf.
  • The clinic-duty person logs into the root mysql shell for each replica and runs GRANT USAGE ON *.* TO '$mysql_username'@'%' WITH MAX_USER_CONNECTIONS $number; Please note that you can always view grants on the server you are on with SHOW GRANTS FOR '$mysql_username';
  • Record the change in modules/role/templates/mariadb/grants/wiki-replicas.sql
  • To persist the change on rebuilds, add the new value to hieradata/role/common/wmcs/nfs/primary.yaml
  • Communicate with the user that they should be all set!

De-pooling Replica Servers

See Portal:Data Services/Admin/Runbooks/Depool wikireplicas

Updating views

See Portal:Data Services/Admin/Runbooks/Deploy wiki replicas view change

Update a wiki's section

Although it is a rare operation, sometimes production wikis get moved from one section to a different one. For example, if two wikis get moved from s3 to s5, the following steps need to taken on the wikireplicas:

  • Update modules/profile/files/openstack/base/pdns/recursor/labsdb.zone, follow this example: https://gerrit.wikimedia.org/r/619627
  • Run /usr/local/sbin/maintain-meta_p on each wikireplica host
    • Double check that the slices look good and the new wikis are on the correct slice, in this case, s5:
sql meta_p
select dbname, slice from wiki where dbname in ('muswiki', 'mhwiktionary');
  • From a cloudcontrol node (like cloudcontrol1004.wikimedia.org) run:
wmcs-wikireplica-dns --aliases --shard s3
wmcs-wikireplica-dns --aliases --shard s5

History

Before September 2017 we had a cluster composed of three physical servers: labsdb1001, labsdb1002 and labsdb1003. Each host contained a copy of all database sections combined into a single database service which also allowed users to add their own custom tables to each wiki's namespace. DNS names for connecting to these servers used a ${wikidb}.labsdb naming convention. These legacy service names survive as CNAME pointers to the appropriate current (2020) s${SECTION_NUMBER}.analytics.db.svc.wikimedia.cloud name. Records are only maintained for wikis which existed at the time of the migration to the 2017 system.

In 2017 we deployed a new cluster of three physical servers with faster hardware, improved replication, and new service names. Each host still contained a merged view of all database sections under a single MariaDB process. User created tables were not allowed in this redesign to improve the ability to shift request traffic from one instance to another. This redesign also introduced the "web" vs "analytics" split for performance tuning. Read more details in the New Wiki Replica servers ready for use blog post from the launch announcement.

In 2021 we deployed the "2020" cluster of 8 physical servers with a new "multiinstance" design. Multiinstance separates database sections into distinct MariaDB instances with two instances hosted on each physical server. See News/Wiki Replicas 2020 Redesign for more information about the user facing changes this required.

In 2023-2024 we updated the load balancing layer to the cloudlb setup, previously there were lots of more layers and manual steps involved.

See also