Dumps/Dump servers

XML Dump servers

Hardware

We have two clouddumps hosts with Puppet role dumps::distribution::server:

clouddumps1001.wikimedia.org in eqiad, production, nfs server to WMF Cloud and stats hosts:
Hardware/OS: Dell PowerEdge R740xd2, Debian 11 (bullseye), 512GB RAM, 2 12-core Intel Xeon Silver 4214 2.2G cpus

Disks: 2 internal 480 GB SSD 1T drives for the OS in raid 1, 24 x 18TB drives in raid 10 for dumps
clouddumps1002.wikimedia.org in eqiad, production, web server, rsync to public mirrors:
Hardware/OS: Dell PowerEdge R740xd2, Debian 11 (bullseye), 512GB RAM, 2 12-core Intel Xeon Silver 4214 2.2G cpus

Disks: 2 internal 480 GB SSD 1T drives for the OS in raid 1, 24 x 18TB drives in raid 10 for dumps

Note that these hosts also serve other public datasets such as some POTY files, the pagecount stats, etc.

Services

The production host serves dump files and other public data sets to the public, using nginx. It also serves as an rsync server to our mirrors and to labs.

Deploying a new host

You'll need to set up the raid arrays by hand. We typically have two arrays so set up two raid 10 arrays with LVM to make one giant 64T volume, ext4.

Install in the usual way (add to puppet, copying a pre-existing production labstorexxx host stanza, set up everything for PXE boot and go). Depending on what the new box is going to do, you'll need to choose the appropriate role (web/rsync, or nfs work), or combine profiles to create a new role.

Space issues

If we run low on space, we can keep fewer rounds of XML dumps; this is controlled by /etc/dumps/xml_keeps.conf on each host. This file is generated by puppet. The hosts where generated dumps are written as they are created, keep only a few dumps, and the web servers and such keep many more.

The class dumps::web::cleanups::xmldumps generates one list of how many dumps to keep for hosts that are 'replicas', i.e. the web servers, with larger keep numbers, and one list for the generating hosts (the nfs servers where dumps are written during each run). The list $keep_replicas is the one you want to tweak; the number of dumps can be adjusted for the huge wikis (enwiki, wikidatawiki), the big wikis (such as dewiki, commonswiki, etc) and then the rest separately.

Maintenance/downtime

If we expect extended downtime for either host, there isn't a well defined procedure.

Both hosts are intended to have the same data at all times (kept up-to-date via rsync on a timer), but they serve a mixture of NFS to internal clients, rsync to public mirrors, and public web services (dumps.wikimedia.org).

Currently, clouddumps1002 is the active server for dumps.wikimedia.org and rsync to public mirrors, as configured with a DNS CNAME:

dumps.wikimedia.org.    3600    IN      CNAME   clouddumps1002.wikimedia.org.

But clouddumps1001 is currently referenced as the NFS source when reloading wikidata entities to WDQS, in this cookbook, so they are both in service. If we are dealing with extended downtime for these hosts, then we can reconfigure things so that they all run from one server, but a quick reboot is generally OK to carry out. This is particularly true on the NFS side, since clients will simply block and await the service coming back. I could have updated the DNS alias and served dumps from clouddumps1001 for a few minutes, then reverted, but a couple of minutes downtime for that service was acceptable.