Jump to content

Portal:Data Services/Admin/Runbooks/Depool wikireplicas

From Wikitech
The procedures in this runbook require admin permissions to complete.

Wikireplicas need work sometimes, fail, etc. This details how to depool a wiki replica database server.

Depooling

The pool status of the wiki replica hosts is managed via conftool.

Each section (s1, s2, etc.) is pooled on two different hosts (one "web" host and one "analytics" host), so you can depool a section from one host and HAProxy (running in cloudlb100[12]) will redirect the traffic to the other host.

You should never depool the same section from both the "web" and "analytics" hosts at the same time.

When all hosts are pooled, the config is like this:

{"clouddb1013.eqiad.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplica-db-web,service=s3"}
{"clouddb1013.eqiad.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplica-db-web,service=s1"}
{"clouddb1014.eqiad.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplica-db-web,service=s7"}
{"clouddb1014.eqiad.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplica-db-web,service=s2"}
{"clouddb1015.eqiad.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplica-db-web,service=s4"}
{"clouddb1015.eqiad.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplica-db-web,service=s6"}
{"clouddb1016.eqiad.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplica-db-web,service=s5"}
{"clouddb1016.eqiad.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplica-db-web,service=s8"}
{"clouddb1017.eqiad.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplica-db-analytics,service=s3"}
{"clouddb1017.eqiad.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplica-db-analytics,service=s1"}
{"clouddb1018.eqiad.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplica-db-analytics,service=s7"}
{"clouddb1018.eqiad.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplica-db-analytics,service=s2"}
{"clouddb1019.eqiad.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplica-db-analytics,service=s4"}
{"clouddb1019.eqiad.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplica-db-analytics,service=s6"}
{"clouddb1020.eqiad.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplica-db-analytics,service=s8"}
{"clouddb1020.eqiad.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplica-db-analytics,service=s5"}

web vs analytics considerations

"web" and "analytics" servers are exactly the same, except they have a different timeout setting in /etc/default/wmf-pt-kill: "web" servers are configured to kill any queries taking more than 300 seconds (5 minutes), while "analytics" servers have a higher timeout of 10800 seconds (3 hours).

If we are depooling a "web" server, you don't need to do anything special, because all traffic will go to the "analytics" server which has a higher timeout. However, if you are depooling an "analytics" server, traffic will go to the "web" server and users will suddenly experience a lower timeout for their "analytics" queries. To work around this issue, you can increase temporarily the timeout in the "web" server that is not depooled.

For example if you are depooling clouddb1020, which is the "analytics" server for sections s5 and s8, you can increase the timeout in the corresponding "web" server for those sections (clouddb1016):

root@clouddb1016:~# vi /etc/default/wmf-pt-kill # set BUSY_TIME="10800"
root@clouddb1016:~# systemctl restart wmf-pt-kill@s5
root@clouddb1016:~# systemctl restart wmf-pt-kill@s8

# Verify that your change took effect,
# you should see '--busy-time 10800' in the parameters
root@clouddb1016:~# ps ax |grep wmf-pt-kill

# Puppet will revert your change to /etc/default/wmf-pt-kill
# but will NOT restart the systemd units.
# When you no longer need the higher timeout, remember to restart the units:
root@clouddb1016:~# systemctl restart wmf-pt-kill@s5
root@clouddb1016:~# systemctl restart wmf-pt-kill@s8

Depooling using conftool on cumin hosts

The confctl utility can be used on any of the main cumin hosts (cumin1002.eqiad.wmnet, cumin2002.codfw.wmnet):

taavi@cumin1002 ~ $ sudo confctl select name=clouddb1013.eqiad.wmnet get
{"clouddb1013.eqiad.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplica-db-web,service=s3"}
{"clouddb1013.eqiad.wmnet": {"weight": 100, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplica-db-web,service=s1"}

taavi@cumin1002 ~ $ sudo confctl select name=clouddb1013.eqiad.wmnet,service=s3 set/pooled=no
eqiad/wikireplica-db-web/s3/clouddb1013.eqiad.wmnet: pooled changed yes => no
WARNING:conftool.announce:conftool action : set/pooled=no; selector: name=clouddb1013.eqiad.wmnet,service=s3

taavi@cumin1002 ~ $ sudo confctl select name=clouddb1013.eqiad.wmnet,service=s3 set/pooled=yes
eqiad/wikireplica-db-web/s3/clouddb1013.eqiad.wmnet: pooled changed no => yes
WARNING:conftool.announce:conftool action : set/pooled=yes; selector: name=clouddb1013.eqiad.wmnet,service=s3

Depooling using confctl shortcuts on the hosts themselves

confctl shortcuts can also be run on the hosts themselves, but first we need to install conftool::scripts to clouddb hosts (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1038847).

Support contacts

If you are following this, you are probably already a part of the WMCS or Data Engineering SRE team. Perhaps you can ask the team you are not on if you need more help?