GitLab/Failover - Wikitech

If there is sufficient time before the maintenance add it to next week's Tech News

GitLab has a active host and one or more replicas. The replicas are cold-standby currently, meaning they don't serve any production traffic and hold up to 24h old data. For maintenance or in case of emergency it is possible to failover the active host to a replica. This page describes the process broadly.

This process is not automated and can cause data loss!

The process takes around 1h to 1:30h (depending on backup size). During that time GitLab is not available.

Prerequisites

The host to failover to should be a proper GitLab replica, meaning:

has a second IPv4 and IPv6 address configured as profile::gitlab::service_ip_v4 and profile::gitlab::service_ip_v6
is running the puppet role(gitlab)
has enough disk space

Planned Failover

A planned failover means the old production instance is responding and working properly and doing a recent backup is possible. There is no data loss. The following steps are needed to failover to a new host.

Before failover

make sure you have access to the new instance (should be in place already, because accounts and tokens are similar to production instance):
- and can login to the instance
- you have admin privileges and you created a personal access token
apply gitlab-settings to new host (done for all replicas)
announce downtime some days ahead on engineering-all, #wikimedia-gitlab
create a GitLab broadcast message with information about the failover
create a phabricator task to track the maintenance (like this one)

Changes to prepare in advance

configure new host with profile::gitlab::service_name: 'gitlab.wikimedia.org' (or other name if switching replicas, example change 802150)
configure new host in profile::gitlab::active_host (example change 802150, not needed for replicas)
configure DNS entry for gitlab.wikimedia.org to new host (or other name if switching replicas, example change 802473)

During failover (using cookbook)

The cookbook will handle all of the manual steps of the failover, except for merging any puppet or DNS changes. Be sure to make sure that you have the --switch-from and --switch-to hosts correct. Running the cookbook with the --dry-run flag will let you see the steps handled without making any changes. The cookbook will take about 2 hours. Ensure that it is run in tmux/screen.

Start gitlab failover cookbook on the cumin host with cookbook sre.gitlab.failover --switch-from <current gitlab host> --switch-to <new gitlab host> -t Tzzzz
Confirm that the intended migration is correct (e.g., that gitlab.wikimedia.org, gitlab-replica-a.wikimedia.org and gitlab-replica-b.wikimedia.org will be on the correct machine at the end)
When prompted to, merge the change prepared earlier to set profile::gitlab::service_name: 'gitlab.wikimedia.org'
When prompted to, merge the change prepared earlier for changing the DNS entries, then run authdns-update on the DNS master (following DNS#Changing records in a zonefile)
Test that the new gitlab instance looks correct.

During failover (manual steps)

pause all GitLab Runners
stop puppet on old host with sudo disable-puppet "Failover in progress"
stop write access on nginx and ssh-gitlab on old host with gitlab-ctl stop nginx and systemctl stop ssh-gitlab
create full backup on old host:

/usr/bin/gitlab-backup create CRON=1 STRATEGY=copy GZIP_RSYNCABLE="true" GITLAB_BACKUP_MAX_CONCURRENCY="4" GITLAB_BACKUP_MAX_STORAGE_CONCURRENCY="1"

sync backup, on to new host: /usr/bin/rsync -avp /srv/gitlab-backup/ rsync://<NEW_HOST>.wikimedia.org/data-backup

configure new host with profile::gitlab::service_name: 'gitlab.wikimedia.org' (example change 802150)

configure new host in profile::gitlab::active_host (example change 802150)
trigger restore on new host /srv/gitlab-backup/gitlab-restore.sh
Point DNS entry for gitlab.wikimedia.org to new host (example change 802473) and run authdns-update
verify installation (login, push, pull, look at metrics)
run puppet on new host
enable puppet on old host with sudo enable-puppet "Failover in progress"
unpause all GitLab Runners
announce end of downtime

After failover

Make sure new instance works properly (login, trigger a CI job)
Check if timers for backups are present on production host
Check if timers for restore are present on replicas only

Unplanned Failover

A unplanned failover means the old production instance is not responding/lost and it is not possible to create a backup is possible. There is up to 24 hours of data loss GitLab.

Get as new data as possible

Check the age of the backup in bacula and on the existing replicas. If the backup is reasonably new, use this backup (make sure to check GitLab/Backup and Restore#Fetch backups from bacula). If that backup is too old, try to manually schedule a database dump and rsync the git repositories. However this is not an automated step and needs more planning.

During failover

The following steps assume that the old host is not available anymore and a replica with the most recent ("latest") backup is used to failover:

configure new host with profile::gitlab::service_name: 'gitlab.wikimedia.org' (example change 802150)

configure new host in profile::gitlab::active_host (example change 802150)
if needed, trigger a restore on new host /srv/gitlab-backup/gitlab-restore.sh (not needed if new backup can't be created)
Point DNS entry for gitlab.wikimedia.org to new host (example change 802473) and run authdns-update
verify installation (login, push, pull, look at metrics)
run puppet on new host

Aborting Failover/Rollback

Sometimes, you will need to abort the failover while the work is in progress. These instructions do not apply once you have finalised the failover process, switching back at that point will require working the full process again. Find the point that your need to start at, and work the remainder of the instructions.

I've already merged some changes

After you merge changes, there are two possible situations. In both cases, this situation should be resolved before restarting gitlab and removing the deploy page.

I've merged the DNS change

This should be the least likely scenario, since all previous steps should have gone correctly. Aborting the failover at this point opens the possibility that writes to the new host have started, and may create a split-brain situation if you decide to switch back. In almost all cases, you will be better off leaving the host as it is, since this is the last step of the migration. If you do need to abort the migration at this point, you should STRONGLY consider following the full failover process. Be careful here. Data can be lost! If you're ok with potentially losing data, then revert your DNS change, and follow the two sections below (reverting the puppet change, and restoring access to the old gitlab instance).

I've merged the puppet change

If the DNS change has not been merged yet, then no changes will have been made to the new system which means they should both be in sync. In this case, revert your change that switches the profile::gitlab::service_name and profile::gitlab::active_host, run puppet on both hosts, and then proceed with restoring access to gitlab with the instructions below.

I haven't merged anything yet

Up to the point where you merge changes, rolling back is fairly simple, since we haven't changed anything on either hosts, only gone as far as creating a backup:

Restart gitlab on the host that will remain as the primary: sudo gitlab-ctl restart; sudo systemctl start gitlab-ssh; sudo gitlab-ctl depoloy-page up
Restart puppet on both hosts: sudo enable-puppet
Check if you need to remove any backup files that were created, in order to not fill up the disk

This page is a part of the SRE Collaboration Services technical documentation
(go here for a list of all our pages)