Jump to content

Help:Wiki Replicas

From Wikitech
(Redirected from Wiki replicas)
This page can be improved by moving in content from other docs. See phab:T232404. Contributions welcome!

Wiki Replicas are databases containing a sanitized version of the Wikimedia's production databases.

This service is available to:

The Wiki Replicas are part of the Data Services provided by Wikimedia Cloud Services.

Connecting to the database replicas

You can connect to the database replicas (and/or the cluster where a database replica is hosted) by specifying your access credentials and the alias of the cluster and replicated database. The location of these credentials varies between the services that have access to the Wiki Replicas, see the documentation for the specific service you are using for more details.

Naming conventions

As a convenience, each MediaWiki project database (enwiki, bgwiki, etc) has an alias to the cluster it is hosted on. The alias has the form:

${PROJECT}.{analytics,web}.db.svc.wikimedia.cloud

where ${PROJECT} is the internal database name of a hosted Wikimedia project.

analytics vs web

The choice of "analytics" or "web" is up to you. The analytics service name connects to Wiki Replica servers where SQL queries will be allowed to run for a longer duration (currently 3 hours instead of 5 minutes),[1][2] but at the cost of all queries being potentially slower. Use of the web service name should be reserved for webservices which are running queries that display to users.

Language codes and project families

Wikipedia project database names generally follow the format ${LANGUAGE_CODE}${PROJECT_FAMILY}. ${LANGUAGE_CODE} is the ISO 639 two-letter code for the primary content language (e.g. en for English, es for Spanish, bg for Bulgarian, ...). ${PROJECT_FAMILY} is an internal label for the wiki's project family (e.g. wiki for Wikipedia, wiktionary for Wiktionary, ...). Some wikis such as Meta-Wiki have database names that do not follow this pattern (metawiki). The full mapping of wikis to database names is available via the db-names Toolforge tool.

The replica database names themselves consist of the Wikimedia project name, suffixed with _p (an underscore, and a p), for example:

enwiki_p for the English Wikipedia replica

Sections

In addition each cluster can be accessed by the name of its Wikimedia production sections which follows the format s${SECTION_NUMBER}.{analytics,web}.db.svc.wikimedia.cloud (for example, s1.analytics.db.svc.wikimedia.cloud hosts the enwiki_p database). The section where a particular database is can change over time. You should only use the section name for opening a database connection if your application requires it for specific performance reasons such as for heavily crosswiki tools which would otherwise open hundreds of database connections.

Old names

You may find outdated documentation that uses *.labsdb aliases (for example enwiki.labsdb) to refer to the Wiki Replica databases. These service names are deprecated and have not had new wikis added since January 2018. Please update the docs or code that you find these references in to use the ${PROJECT}.{analytics,web}.db.svc.wikimedia.cloud naming convention.
You may find outdated documentation that uses ${project}.{analytics,web}.db.svc.eqiad.wmflabs aliases (for example enwiki.web.db.svc.eqiad.wmflabs) to refer to the Wiki Replica databases. These service names are deprecated. Please update the docs or code that you find these references in to use the ${PROJECT}.{analytics,web}.db.svc.wikimedia.cloud naming convention.

Connection handling policy

Usage of connection pools (maintaining open connections without them being in use), persistent connections, or any kind of connection pattern that maintains several connections open even if they are unused is not permitted on shared MariaDB instances (Wiki Replicas and ToolsDB).

The memory and processing power available to the database servers is a finite resource. Each open connection to a database, even if inactive, consumes some of these resources. Given the number of potential users for the Wiki Replicas and ToolsDB, if even a relatively small percentage of users held open idle connections, the server would quickly run out of resources to allow new connections. Please close your connections as soon as you stop using them. Note that connecting interactively and being idle for a few minutes is not an issue—opening dozens of connections and maintaining them automatically open is.

Idle connections can and will be killed by database and system administrators when discovered. If you (for example, by connector configuration or application policy) then reopen those connections automatically and keep them idle, you will be warned to stop.

Connecting to the wiki replicas from other Cloud VPS projects

The *.{analytics,web}.db.svc.wikimedia.cloud servers should be directly accessible from other Cloud VPS projects as well as Toolforge (these are provided in DNS), but there is no automatic creation of database credential files. The easiest way to get user credentials for use in another project is to create a Toolforge tool account and copy its credentials to your Cloud VPS instance.

Connecting to the database replicas from your own computer

The Wiki Replicas are not publicly accessible from the internet.[3] However, Toolforge users can access the replicas from your own computer by setting up an SSH tunnel. If you use MySQL Workbench, you can find a detailed description for that application below.

Tunneling is a built-in capability of ssh. It allows creating a listening TCP port on your local computer that will transparently forward all connections to a given host and port on the remote side of the ssh connection. The destination host and port do not need to be the host that you are connecting to with your ssh session, but they do need to be reachable from the remote host.

In the general case, need to add a port forwarding in your ssh tool. Windows 10 has OpenSSH included and the ssh command can be used. On older versions of Windows, you can use the tool PuTTY by add in Connection → SSH → Tunnels the following settings (as shown in dialog box at right).

PuTTY Tunnels Configuration

In Linux or Windows 10, you can add the option -L $LOCAL_PORT:$REMOTE_HOST:$REMOTE_PORT to your ssh call, e. g.:

$ ssh -L 3306:enwiki.analytics.db.svc.wikimedia.cloud:3306 yourusername@login.toolforge.org

This will set up a tunnel so that connections to port 3306 on your own computer will be relayed to the enwiki.analytics.db.svc.wikimedia.cloud database replica's MariaDB server on port 3306. This tunnel will continue to work as long as the SSH session is open.

The mariadb command line to connect using the tunnel from the example above would look something like:

$ mariadb --user=$USER_FROM_REPLICA.MY.CNF --host=127.0.0.1 --port=3306 --password enwiki_p

The user and password values needed can be found in the $HOME/replica.my.cnf credentials file for your Toolforge user account or a tool that you have access to.

Note that you need to explicitly use the 127.0.0.1 IP address; using localhost instead will give an error as the client will try to connect over an Unix socket which will not work.

SSH tunneling for local testing which makes use of Wiki Replica databases

  1. Setup SSH tunnels: ssh -N yourusername@dev.toolforge.org -L 3306:enwiki.analytics.db.svc.wikimedia.cloud:3306
    • -N prevents ssh from opening an interactive shell. This connection will only be useful for port forwarding.
    • The first port is the listening port on your machine and the second one is on the remote server. 3306 is the default port for MySQL.
    • For multiple database connections, add additional -L $LOCAL_PORT:$REMOTE_HOST:$REMOTE_PORT sections to the same command or open additional ssh connections.
    • If you need to connect to more than one Wiki Replica database server, each database will need a different listening port on your machine (e.g. 3307, 3308, 3309, ...). Change the associated php/python connect command to send requests to that port instead of the default 3306.
  2. (optional) Edit your /etc/hosts file to add something like 127.0.0.1 enwiki.analytics.db.svc.wikimedia.cloud for each of the databases you're connecting to.
  3. You might need to copy over the replica.my.cnf file to your local machine for this to work.

TLS connection failures

Some client libraries may attempt to enable TLS encryption when connecting to the Wiki Replica or ToolsDB databases. Depending on the backing server's configuration, this may either fail silently because TLS is not supported at all, or it may fail with authentication or decryption errors because TLS is partially enabled. In this second case, the problem is caused by MariaDB servers which do support TLS encryption but are using self-signed certificates which are not available to the client and do not match the service names used for connections from Cloud Services hosts.

The "fix" for these failures is to configure your client to avoid TLS encryption. How to do this will vary based on the client libraries in use, but should be something that you can find an answer for by searching the Internet/Stack Overflow/library documentation.

Connecting with...

MySQL Workbench

If you are using an ed25519 key, with a passcode, you might have issues configuring this. See the MySQL bug. Consider establishing a separate SSH tunnel outside of MySQL Workbench, then using MySQL Workbench with connection method "Standard (TCP/IP)" and hostname 127.0.0.1, the other credentials remaining unchanged.
Example configuration of MySQL Workbench for Toolforge

You can connect to databases on Toolforge with MySQL Workbench (or similar client applications) via an SSH tunnel.

Instructions for connecting via MySQL Workbench are as follows:

  1. Launch MySQL Workbench on your local machine.
  2. Click the plus icon next to "MySQL Connections" in the Workbench window (or choose "Manage Connections..." from the Database menu and click the "new" button).
  3. Set Connection Method to "Standard TCP/IP over SSH"
  4. Set the following connection parameters:
    • SSH Hostname: login.toolforge.org
    • SSH Username: <your Toolforge shell username>
    • SSH Key File: <your Toolforge SSH private key file>[4]
    • SSH Password: password/passphrase of your private key (if set) - not your wiki login password.
    • MySQL Hostname: enwiki.analytics.db.svc.wikimedia.cloud (or whatever server your database lives on)
    • MySQL Server Port: 3306
    • Username: <your Toolforge MariaDB user name (from $HOME/replica.my.cnf)>
    • Password: <your Toolforge MariaDB password (from $HOME/replica.my.cnf)>
    • Default Schema: <name of the database, e.g. enwiki_p>
  5. Click "OK"

If you are using SSH keys generated with PuTTYgen (Windows users), you need to convert your private key to the 'OpenSSH' format. Load your private key in PuTTYgen, then click Conversions » Export OpenSSH key. Use this file as SSH Key File above.

If you are getting errors with SSL, you can try disabling it. From the menu bar: Database -> select your connection -> SSL -> Change "Use SSL" to "No".

Databases

Replica database schema (tables and indexes)

The database replicas for the various Wikimedia projects follow the standard MediaWiki database schema described on mediawiki.org and in the MediaWiki git repository.

Many of the indexes on these tables are actually compound indexes designed to optimize the runtime performance of the MediaWiki software rather than to be convenient for ad hoc queries. For example, a naive query by page_title such at SELECT * FROM page WHERE page_title = 'NOFX'; will be slow because the index which includes page_title is a compound index with page_namespace. Adding page_namespace to the WHERE clause will improve the query speed dramatically: SELECT * FROM page WHERE page_namespace = 0 AND page_title = 'NOFX';

Stability of the mediawiki database schema

sql/mysql/tables-generated.sql shows the HEAD of the mediawiki changes. Extra tables may be available due to additional extensions setup in production. Also some tables may have been redacted or filtered for containing private data such as the user passwords or private ip addresses. Aside from that, while we try to synchronize production with development HEAD, changes to the database structure may be applied in advance (or more commonly) lag behind its publication. The reason for this is that schema changes are being continuously applied to production databases, and due to the amout of data, it may take a few hours to a few months (in the case of more complex cases) to be finalized.

Core tables, such as revision, page, user, recentchanges rarely change, but cloud maintainers cannot guarantee they will never change, as they have to follow the production changes. While we are happy for people to setup scripts and tools on top of the database copies (wikireplicas) expect the schema to change every now and then. If you cannot do small tweaks from time to time to adapt to the latest schema changes, using the API instead of the database internals is suggested, as API changes have more guarantees of stability and a proper lifecycle and deprecation policy. That is not true for mediawiki database internals, although compatibility views can sometimes be setup to require only minimal changes.

Tables for revision or logging queries involving user names and IDs

The revision and logging tables do not have indexes on user columns. In an email, one of the system administrators pointed out that this is because "those values are conditionally nulled when supressed" (see also phab:T68786 for some more detail). One has to instead use the corresponding revision_userindex or logging_userindex for these types of queries. On those views, rows where the column would have otherwise been nulled are elided; this allows the indexes to be usable.

Example query that will use the appropriate index (in this case on the rev_actor column)

SELECT rev_id, rev_timestamp FROM revision_userindex WHERE rev_actor=1234;

Example query that fails to use an index because the table doesn't have them:

SELECT rev_id, rev_timestamp FROM revision WHERE rev_actor=1234;

You should use the indexes so queries will go faster (performance).

Redacted tables

The majority of the user_properties table has been deemed sensitive and removed from the Wiki Replica databases. Only the disableemail, fancysig, gender, and nickname properties are available.

Unavailable tables

Some of the standard MediaWiki tables that are in use on Wikimedia wikis are not available. The following tables are missing or empty:

Metadata database

There is a table with automatically maintained meta information about the replicated databases: meta_p.wiki. See toolforge:db-names for a web-based list.

The database host containing the meta_p database is: meta.analytics.db.svc.wikimedia.cloud.

MariaDB [meta_p]> DESCRIBE wiki;
+------------------+--------------+------+-----+---------+-------+
| Field            | Type         | Null | Key | Default | Extra |
+------------------+--------------+------+-----+---------+-------+
| dbname           | varchar(32)  | NO   | PRI | NULL    |       |
| lang             | varchar(12)  | NO   |     | en      |       |
| name             | text         | YES  |     | NULL    |       |
| family           | text         | YES  |     | NULL    |       |
| url              | text         | YES  |     | NULL    |       |
| size             | decimal(1,0) | NO   |     | 1       |       |
| slice            | text         | NO   |     | NULL    |       |
| is_closed        | decimal(1,0) | NO   |     | 0       |       |
| has_echo         | decimal(1,0) | NO   |     | 0       |       |
| has_flaggedrevs  | decimal(1,0) | NO   |     | 0       |       |
| has_visualeditor | decimal(1,0) | NO   |     | 0       |       |
| has_wikidata     | decimal(1,0) | NO   |     | 0       |       |
| is_sensitive     | decimal(1,0) | NO   |     | 0       |       |
+------------------+--------------+------+-----+---------+-------+

Example data:

MariaDB [meta_p]> select * from wiki limit 1 \G
*************************** 1. row ***************************
          dbname: aawiki
            lang: aa
            name: Wikipedia
          family: wikipedia
             url: https://aa.wikipedia.org
            size: 1
           slice: s3.labsdb
       is_closed: 1
        has_echo: 1
 has_flaggedrevs: 0
has_visualeditor: 1
    has_wikidata: 1
    is_sensitive: 0

Identifying lag

Extended replication lag (on the order of multiple days) can be an expected and unavoidable side effect of some types of production database maintenance (e.g. schema changes). When this cause is confirmed, expect wiki replicas to catch up automatically once the maintenance finishes.

If there is a network/Wiki Replica db infrastructure problem, production problem, maintenance (scheduled or unscheduled), excessive load or production or user's queries blocking the replication process, the Wiki Replicas can "lag" behind the production databases.

To identify lag, see the replag tool or execute yourself on the database host you are connected to:

(u3518@enwiki.analytics.db.svc.wikimedia.cloud) [heartbeat_p]> SELECT * FROM heartbeat;
+-------+----------------------------+--------+
| shard | last_updated               | lag    |
+-------+----------------------------+--------+
| s1    | 2018-01-09T22:47:05.001180 | 0.0000 |
| s2    | 2018-01-09T22:47:05.001190 | 0.0000 |
| s3    | 2018-01-09T22:47:05.001290 | 0.0000 |
| s4    | 2018-01-09T22:47:05.000570 | 0.0000 |
| s5    | 2018-01-09T22:47:05.000670 | 0.0000 |
| s6    | 2018-01-09T22:47:05.000760 | 0.0000 |
| s7    | 2018-01-09T22:47:05.000690 | 0.0000 |
| s8    | 2018-01-09T22:47:05.000600 | 0.0000 |
+-------+----------------------------+--------+
8 rows in set (0.00 sec)

This table is based on the tool pt-heartbeat, not on SHOW MASTER STATUS, producing very accurate results, even if replication is broken, and directly comparing it to the original master, and not the replicas's direct master.

  • shard: s1-8. Each of the production masters. The wiki distribution can be seen at: https://noc.wikimedia.org/db.php
  • last_updated: Every 1 second, a row in the master is written with the date local to the master. Here you have its value, once replicated. As it is updated every 1 second, it has a measuring error of [0, 1+] seconds.
  • lag: The difference between the current date and the last_updated column (timestampdiff(MICROSECOND,`heartbeat`.`heartbeat`.`ts`,utc_timestamp())/1000000.0). Again note that updates to this table only happen every second (it can vary on production), so most decimals are meaningless.

To directly query the replication lag for a particular wiki, use requests like:

MariaDB [fawiki_p]> SELECT lag FROM heartbeat_p.heartbeat JOIN meta_p.wiki ON shard = SUBSTRING_INDEX(slice, ".", 1) WHERE dbname = 'fawiki';

+------+
| lag  |
+------+
|    0 |
+------+
1 row in set (0.09 sec)

Please note that some seconds or a few minutes of lag is considered normal, due to the filtering process and the hops done before reaching the public hosts.

Administration

The Wiki Replicas are jointly managed by the following teams: WMCS, Data Persistence, Data Engineering and Data Platform SRE. See also: Who admins what.

Admin docs

Notes

  1. https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/role/common/wmcs/db/wikireplicas/web_multiinstance.yaml#10
  2. https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/role/common/wmcs/db/wikireplicas/analytics_multiinstance.yaml#10
  3. phabricator:T318191
  4. If your private key is in a RFC4716 format, you will have to convert it to a PEM key.

See also

Communication and support

Support and administration of the WMCS resources is provided by the Wikimedia Foundation Cloud Services team and Wikimedia movement volunteers. Please reach out with questions and join the conversation:

Discuss and receive general support
Stay aware of critical changes and plans
Track work tasks and report bugs

Use a subproject of the #Cloud-Services Phabricator project to track confirmed bug reports and feature requests about the Cloud Services infrastructure itself

Read stories and WMCS blog posts

Read the Cloud Services Blog (for the broader Wikimedia movement, see the Wikimedia Technical Blog)