MariaDB/Decommissioning a DB Host
Appearance
< MariaDB
Prerequisites:
- SSH access to one of the cluster management hosts (
cumin1002.eqiad.wmnet, cumin2002.codfw.wmnet
) to depool + run the decommissioning script - SSH access to puppetmaster1001.eqiad.wmnet to merge puppet changes
- Access to Pwstore
- Git repositories cloned to your host:
Decommissioning workflow:
Create a tracking ticket
- Create a decommission ticket with the following template: https://phabricator.wikimedia.org/maniphest/task/edit/form/52/
- If there is hardware problems, please specify so for the DCOps to label it so we do not re-use broken pieces.
Depool the host
- SSH to one of the cluster management hosts (
cumin1002.eqiad.wmnet, cumin2002.codfw.wmnet
) dbctl instance HOSTNAME depool && dbctl config commit -m "Depool db1091 TASKNUMBER"
Remove the host from dbctl
- Create a puppet patch (example: https://gerrit.wikimedia.org/r/c/operations/puppet/+/638343)
- SSH to puppetmaster1001
sudo puppet-merge
- if you see any changes other than yours here, contact the owners to see if these are ok to merge- SSH to one of the cluster management hosts (
cumin1002.eqiad.wmnet, cumin2002.codfw.wmnet
) sudo dbctl config commit -m "Remove HOSTNAME from dbctl TASKNUMBER"
Remove all other puppet entries
- Create a puppet patch (example: https://gerrit.wikimedia.org/r/c/operations/puppet/+/638352)
- Changes to dhcp are no longer needed, so no need to edit: linux-host-entries.ttyS1-115200
- DO NOT merge the patch yet
Run the decommissioning script
- SSH to one of the cluster management hosts (
cumin1002.eqiad.wmnet, cumin2002.codfw.wmnet
) - Start a
screen
ortmux
session sudo cookbook sre.hosts.decommission -t TASKNUMBER HOSTNAME.DC.wmnet
- Enter console password from Pwstore
Merge puppet change
- SSH to puppetmaster1001
sudo puppet-merge
- if you see any changes other than yours here, contact the owners to see if these are ok to merge
Remove host from zarcillo
- Log the action in IRC (#wikimedia-operations) - !log Removing HOSTNAME from zarcillo TASKNUMBER
- SSH to one of the cluster management hosts (
cumin1002.eqiad.wmnet, cumin2002.codfw.wmnet
) sudo -i
- Zarcillo
db-mysql db1215 -A zarcillo
- Execute the following queries in the MySQL prompt (remember about the semicolon):
set binlog_format='ROW';
delete from servers where hostname like 'HOSTNAME%';
delete from instances where name like 'HOSTNAME%';
(INSTANCE is normally HOSTNAME or HOSTNAME:PORT)delete from section_instances where instance like 'HOSTNAME%';
Remove host from orchestrator
Orchestrator will purge the host automatically within 1-2 weeks but to avoid that delay it should be removed manually
- From the GUI (admin users only)
- From the CLI:
- Log the action in IRC (#wikimedia-operations) -
!log Removing HOSTNAME from orchestrator TASKNUMBER
- SSH to dborch1001.wikimedia.org
- Single-instance host:
sudo orchestrator -c forget -i HOSTNAME:3306
(use the FQDN for the HOSTNAME) - Multi-instance host:
sudo orchestrator -c forget -i HOSTNAME:PORT
for each HOSTNAME:PORT combination (use the FQDN for the HOSTNAME)
- Log the action in IRC (#wikimedia-operations) -
Update the task and send it to dcops
- mark all the steps for "step for service owners" on: https://phabricator.wikimedia.org/T267088
- Reassign:
- for eqiad to wiki_willy
- for codfw to wiki_willy
- Remove #DBA tag and add #dc-ops and #ops-eqiad OR #ops-codfw.
- Add the following comment: "This host is ready for DC-Ops to decommission".
This page is a part of the SRE Data Persistence technical documentation
(go here for a list of all our pages)