GitLab/Failover
If there is sufficient time before the maintenance add it to next week's Tech News |
GitLab has a active host and one or more replicas. The replicas are cold-standby currently, meaning they don't serve any production traffic and hold up to 24h old data. For maintenance or in case of emergency it is possible to failover the active host to a replica. This page describes the process broadly.
The process takes around 1h to 1:30h (depending on backup size). During that time GitLab is not available.
Prerequisites
The host to failover to should be a proper GitLab replica, meaning:
- has a second IPv4 and IPv6 address configured as
profile::gitlab::service_ip_v4
andprofile::gitlab::service_ip_v6
- is running the puppet
role(gitlab)
- has enough disk space
Planned Failover
A planned failover means the old production instance is responding and working properly and doing a recent backup is possible. There is no data loss. The following steps are needed to failover to a new host.
Before failover
- make sure you have access to the new instance (should be in place already, because accounts and tokens are similar to production instance):
- and can login to the instance
- you have admin privileges and you created a personal access token
- apply gitlab-settings to new host (done for all replicas)
- announce downtime some days ahead on engineering-all, #wikimedia-gitlab
- create a GitLab broadcast message with information about the failover
- create a phabricator task to track the maintenance (like this one)
Changes to prepare in advance
- configure new host with
profile::gitlab::service_name: 'gitlab.wikimedia.org'
(or other name if switching replicas, example change 802150) - configure new host in
profile::gitlab::active_host
(example change 802150, not needed for replicas) - configure DNS entry for
gitlab.wikimedia.org
to new host (or other name if switching replicas, example change 802473)
During failover (using cookbook)
The cookbook will handle all of the manual steps of the failover, except for merging any puppet or DNS changes. Be sure to make sure that you have the --switch-from
and --switch-to
hosts correct. Running the cookbook with the --dry-run
flag will let you see the steps handled without making any changes. The cookbook will take about 2 hours. Ensure that it is run in tmux/screen.
- Start gitlab failover cookbook on the cumin host with
cookbook sre.gitlab.failover --switch-from <current gitlab host> --switch-to <new gitlab host> -t Tzzzz
- Confirm that the intended migration is correct (e.g., that
gitlab.wikimedia.org
,gitlab-replica-a.wikimedia.org
andgitlab-replica-b.wikimedia.org
will be on the correct machine at the end) - When prompted to, merge the change prepared earlier to set
profile::gitlab::service_name: 'gitlab.wikimedia.org'
- When prompted to, merge the change prepared earlier for changing the DNS entries, then run
authdns-update
on the DNS master (following DNS#Changing records in a zonefile) - Test that the new gitlab instance looks correct.
During failover (manual steps)
- pause all GitLab Runners
- stop puppet on old host with
sudo disable-puppet "Failover in progress"
- stop write access on nginx and ssh-gitlab on old host with
gitlab-ctl stop nginx
andsystemctl stop ssh-gitlab
- create full backup on old host:
/usr/bin/gitlab-backup create CRON=1 STRATEGY=copy GZIP_RSYNCABLE="true" GITLAB_BACKUP_MAX_CONCURRENCY="4" GITLAB_BACKUP_MAX_STORAGE_CONCURRENCY="1"
- sync backup, on to new host:
/usr/bin/rsync -avp /srv/gitlab-backup/ rsync://<NEW_HOST>.wikimedia.org/data-backup
- configure new host with
profile::gitlab::service_name: 'gitlab.wikimedia.org'
(example change 802150)
- configure new host in
profile::gitlab::active_host
(example change 802150) - trigger restore on new host
/srv/gitlab-backup/gitlab-restore.sh
- Point DNS entry for
gitlab.wikimedia.org
to new host (example change 802473) and runauthdns-update
- verify installation (login, push, pull, look at metrics)
- run puppet on new host
- enable puppet on old host with
sudo enable-puppet "Failover in progress"
- unpause all GitLab Runners
- announce end of downtime
After failover
- Make sure new instance works properly (login, trigger a CI job)
- Check if timers for backups are present on production host
- Check if timers for restore are present on replicas only
Unplanned Failover
A unplanned failover means the old production instance is not responding/lost and it is not possible to create a backup is possible. There is up to 24 hours of data loss GitLab.
Get as new data as possible
Check the age of the backup in bacula and on the existing replicas. If the backup is reasonably new, use this backup (make sure to check GitLab/Backup and Restore#Fetch backups from bacula). If that backup is too old, try to manually schedule a database dump and rsync the git repositories. However this is not an automated step and needs more planning.
During failover
The following steps assume that the old host is not available anymore and a replica with the most recent ("latest") backup is used to failover:
- configure new host with
profile::gitlab::service_name: 'gitlab.wikimedia.org'
(example change 802150)
- configure new host in
profile::gitlab::active_host
(example change 802150) - if needed, trigger a restore on new host
/srv/gitlab-backup/gitlab-restore.sh
(not needed if new backup can't be created) - Point DNS entry for
gitlab.wikimedia.org
to new host (example change 802473) and runauthdns-update
- verify installation (login, push, pull, look at metrics)
- run puppet on new host
Aborting Failover/Rollback
Sometimes, you will need to abort the failover while the work is in progress. These instructions do not apply once you have finalised the failover process, switching back at that point will require working the full process again. Find the point that your need to start at, and work the remainder of the instructions.
I've already merged some changes
After you merge changes, there are two possible situations. In both cases, this situation should be resolved before restarting gitlab and removing the deploy page.
I've merged the DNS change
This should be the least likely scenario, since all previous steps should have gone correctly. Aborting the failover at this point opens the possibility that writes to the new host have started, and may create a split-brain situation if you decide to switch back. In almost all cases, you will be better off leaving the host as it is, since this is the last step of the migration. If you do need to abort the migration at this point, you should STRONGLY consider following the full failover process. Be careful here. Data can be lost! If you're ok with potentially losing data, then revert your DNS change, and follow the two sections below (reverting the puppet change, and restoring access to the old gitlab instance).
I've merged the puppet change
If the DNS change has not been merged yet, then no changes will have been made to the new system which means they should both be in sync. In this case, revert your change that switches the profile::gitlab::service_name
and profile::gitlab::active_host
, run puppet on both hosts, and then proceed with restoring access to gitlab with the instructions below.
I haven't merged anything yet
Up to the point where you merge changes, rolling back is fairly simple, since we haven't changed anything on either hosts, only gone as far as creating a backup:
- Restart gitlab on the host that will remain as the primary:
sudo gitlab-ctl restart; sudo systemctl start gitlab-ssh; sudo gitlab-ctl depoloy-page up
- Restart puppet on both hosts:
sudo enable-puppet
- Check if you need to remove any backup files that were created, in order to not fill up the disk
This page is a part of the SRE Collaboration Services technical documentation
(go here for a list of all our pages)