Portal:Data Services/Admin/Runbooks/Resync a drbd volume
Overview
There are two read/write NFS clusters at this time (00:40, 15 July 2021 (UTC)):
- primary (tools and all home/project dirs other than maps) - labstore1004 and labstore1005
- secondary (maps home/project dirs and scratch) - cloudstore1008 and cloudstore1009
Both are defined as clusters via DRBD in puppet. If replication is interrupted badly or the standby server is suspected of corruption (evidenced by LVM issues, disk problems, or however one comes to such conclusions) it can be necessary to reconnect the pair together aggressively by invalidating the standby server's copy of the data.
Error / Incident
If you cannot get DRBD replication in sync (sudo drbd-overview
doesn't ever suggest that the two are in sync and replication shows errors) or if you are getting corrupted volumes after backups (which could be caused by other errors on the backup side), you may want to try this operation. If you find yourself doing this often, you probably need to fix the standby server or the DAC network connection between the active and standby because this should not be something you do often. It has been done once in three years, for instance.
Process
Disable alerts
Downtime DRBD status alerts. No need to upset everyone.
If this is the primary cluster, disable backups
cloudbackup2001.codfw.wmnet
and cloudbackup2002.codfw.wmnet
run backups against labstore1005 only. This should only be done when labstore1005 is the standby server, so we disable it during failovers. Check if the backup is running with systemctl status
commands against the services mentioned below.
On cloudbackup2001.codfw.wmnet:
sudo -i puppet agent --disable "<myname>: failing over nfs primary cluster for maintenance"
sudo systemctl disable block_sync-tools-project.service
On cloudbackup2002.codfw.wmnet:
sudo -i puppet agent --disable "<myname>: failing over nfs primary cluster for maintenance"
sudo systemctl disable block_sync-misc-project.service
If the backups are currently running, make a call whether to stop the backup (it's done only weekly) with systemctl or let if finish before proceeding.
Invalidate the DRBD Secondary
In general, for monitoring DRBD while you work, this command is nice:
sudo drbd-overview
A good result looks like:
[bstorm@labstore1004]:~ $ sudo drbd-overview
1:test/0 Connected Primary/Secondary UpToDate/UpToDate /srv/test ext4 9.8G 535M 8.7G 6%
3:misc/0 Connected Primary/Secondary UpToDate/UpToDate /srv/misc ext4 5.0T 1.8T 3.0T 38%
4:tools/0 Connected Primary/Secondary UpToDate/UpToDate /srv/tools ext4 8.0T 5.7T 2.0T 75%
Invalidating the secondary is as simple as running:
drbdadm invalidate all
ON THE STANDBY HOST. This declares the disks there to be in need of overwriting from scratch. I hope the drbdadm will stop you from running this on the primary/active host, but nevertheless never run it on the current active host under any circumstances.
Wait for things to sync back up (this will take a while and will slow down writes for a bit).
If all volumes show UpToDate/UpToDate
, you should be good to go.
Support contacts
Anyone doing this will almost certainly be on the Cloud Services team, so find a coworker!
Related information
Portal:Data_Services/Admin/Shared_storage#Clusters https://linbit.com/drbd-user-guide/drbd-guide-9_0-en/#ch-admin-manual