User:Razzi/2021-06-10
Appearance
gonna deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/698194
sudo cookbook sre.hadoop.roll-restart-masters analytics
ok I got an eof error somehow...
razzi@cumin1001:~$ sudo cookbook sre.hadoop.roll-restart-masters analytics [0/0] START - Cookbook sre.hadoop.roll-restart-masters Checking HDFS and Yarn daemon status. We expect active statuses on the Master node, and standby statuses on the other. Please do not proceed otherwise. Checking Master/Standby status. Master status for HDFS: ----- OUTPUT of 'kerberos-run-com...1001-eqiad-wmnet' ----- Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 active ================ PASS |████████████████████████| 100% (1/1) [00:01<00:00, 1.68s/hosts] FAIL | | 0% (0/1) [00:01<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1001-eqiad-wmnet'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Master status for Yarn: ----- OUTPUT of 'kerberos-run-com...1001-eqiad-wmnet' ----- Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 active ================ PASS |████████████████████████| 100% (1/1) [00:01<00:00, 1.54s/hosts] FAIL | | 0% (0/1) [00:01<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1001-eqiad-wmnet'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Standby status for HDFS: ----- OUTPUT of 'kerberos-run-com...1002-eqiad-wmnet' ----- Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 standby ================ PASS |████████████████████████| 100% (1/1) [00:01<00:00, 1.64s/hosts] FAIL | | 0% (0/1) [00:01<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1002-eqiad-wmnet'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Standby status for Yarn: ----- OUTPUT of 'kerberos-run-com...1002-eqiad-wmnet' ----- Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 standby ================ PASS |████████████████████████| 100% (1/1) [00:01<00:00, 1.74s/hosts] FAIL | | 0% (0/1) [00:01<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1002-eqiad-wmnet'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. >>> Please make sure that the active/standby nodes are correct. Type "go" to proceed or "abort" to interrupt the execution > go Scheduling downtime on Icinga server alert1001.wikimedia.org for hosts: an-master[1001-1002].eqiad.wmnet ----- OUTPUT of 'icinga-downtime ...razzi@cumin1001"' ----- ================ PASS |████████████████████████| 100% (1/1) [00:00<00:00, 2.32hosts/s] FAIL | | 0% (0/1) [00:00<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'icinga-downtime ...razzi@cumin1001"'. ----- OUTPUT of 'icinga-downtime ...razzi@cumin1001"' ----- ================ PASS |████████████████████████| 100% (1/1) [00:00<00:00, 2.82hosts/s] FAIL | | 0% (0/1) [00:00<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'icinga-downtime ...razzi@cumin1001"'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Restarting Yarn Resourcemanager on Master. ----- OUTPUT of 'systemctl restar...-resourcemanager' ----- ================ PASS |████████████████████████| 100% (1/1) [00:11<00:00, 11.71s/hosts] FAIL | | 0% (0/1) [00:11<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'systemctl restar...-resourcemanager'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Sleeping 60.0 seconds. Restarting Yarn Resourcemanager on Standby. ----- OUTPUT of 'systemctl restar...-resourcemanager' ----- ================ PASS |████████████████████████| 100% (1/1) [00:11<00:00, 11.69s/hosts] FAIL | | 0% (0/1) [00:11<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'systemctl restar...-resourcemanager'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Checking Master/Standby status. Master status for Yarn: ----- OUTPUT of 'kerberos-run-com...1001-eqiad-wmnet' ----- Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 active ================ PASS |████████████████████████| 100% (1/1) [00:01<00:00, 1.75s/hosts] FAIL | | 0% (0/1) [00:01<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1001-eqiad-wmnet'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Standby status for Yarn: ----- OUTPUT of 'kerberos-run-com...1002-eqiad-wmnet' ----- Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 standby ================ PASS |████████████████████████| 100% (1/1) [00:01<00:00, 1.80s/hosts] FAIL | | 0% (0/1) [00:01<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1002-eqiad-wmnet'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. >>> Ok to proceed with HDFS Namenodes ? Type "go" to proceed or "abort" to interrupt the execution > go Run manual HDFS failover from master to standby. Run manual HDFS Namenode failover from an-master1001-eqiad-wmnet to an-master1002-eqiad-wmnet. ----- OUTPUT of 'kerberos-run-com...1002-eqiad-wmnet' ----- Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 Failover to NameNode at an-master1002.eqiad.wmnet/10.64.21.110:8040 successful ================ PASS |██████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:17<00:00, 17.95s/hosts] FAIL | | 0% (0/1) [00:17<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1002-eqiad-wmnet'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Sleeping 30 seconds. Restart HDFS Namenode on the master. ----- OUTPUT of 'systemctl restart hadoop-hdfs-zkfc' ----- ----- OUTPUT of 'systemctl restar...op-hdfs-namenode' ----- ================ PASS |██████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:29<00:00, 29.38s/hosts] FAIL | | 0% (0/1) [00:29<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Sleeping 600.0 seconds. ^@Checking Master/Standby status. Master status for HDFS: ----- OUTPUT of 'kerberos-run-com...1001-eqiad-wmnet' ----- Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 standby ================ PASS |██████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00, 1.68s/hosts] FAIL | | 0% (0/1) [00:01<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1001-eqiad-wmnet'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Standby status for HDFS: ----- OUTPUT of 'kerberos-run-com...1002-eqiad-wmnet' ----- Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 active ================ PASS |██████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00, 1.65s/hosts] FAIL | | 0% (0/1) [00:01<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1002-eqiad-wmnet'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. >>> Ok to proceed? Type "go" to proceed or "abort" to interrupt the execution > go Exception raised while executing cookbook sre.hadoop.roll-restart-masters: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 234, in run raw_ret = runner.run() File "/usr/lib/python3/dist-packages/spicerack/_module_api.py", line 18, in run return self._run(self.args, self.spicerack) File "/srv/deployment/spicerack/cookbooks/sre/hadoop/roll-restart-masters.py", line 154, in run ask_confirmation("Ok to proceed?") File "/usr/lib/python3/dist-packages/wmflib/interactive.py", line 67, in ask_confirmation ['go', 'abort']) File "/usr/lib/python3/dist-packages/wmflib/interactive.py", line 45, in ask_input response = input('> ') EOFError END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) razzi@cumin1001:~$ razzi@cumin1001:~$
Ok ran the rest of the commands manually.
See a new error on alerts.wikimedia.org:
CRITICAL: Status of the systemd unit monitor_refine_eventlogging_analytics
So I pull up journalctl -u monitor_refine_eventlogging_analytics
Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied. dconf will not work properly. Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied. dconf will not work properly. Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied. dconf will not work properly. Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied. dconf will not work properly. Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied. dconf will not work properly. Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied. dconf will not work properly. Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied. dconf will not work properly. Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied. dconf will not work properly. Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied. dconf will not work properly. Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied. dconf will not work properly. Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: 21/06/10 00:18:23 WARN RefineMonitor: RefineMonitor found problems for path /wmf/data/raw/eventlogging -> database event (/wmf/data/event): Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: The following dataset targets in path /wmf/data/raw/eventlogging between 2021-06-08T00:15:07.000Z and 2021-06-09T20:15:07.001Z either have failed or still need Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: Targets with failures: Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: `event`.`wmdebannerevents` /wmf/data/event/wmdebannerevents/year=2021/month=6/day=9/hour=14 Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: `event`.`wmdebannerevents` /wmf/data/event/wmdebannerevents/year=2021/month=6/day=9/hour=15 Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: `event`.`wmdebannerevents` /wmf/data/event/wmdebannerevents/year=2021/month=6/day=9/hour=16 Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: `event`.`wmdebannerevents` /wmf/data/event/wmdebannerevents/year=2021/month=6/day=9/hour=17 Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: `event`.`wmdebannerevents` /wmf/data/event/wmdebannerevents/year=2021/month=6/day=9/hour=18 Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: `event`.`wmdebannerevents` /wmf/data/event/wmdebannerevents/year=2021/month=6/day=9/hour=19 Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: 21/06/10 00:18:23 INFO RefineMonitor: Sending problem email report to analytics-alerts@wikimedia.org Jun 10 00:18:24 an-launcher1002 systemd[1]: monitor_refine_eventlogging_analytics.service: Main process exited, code=exited, status=1/FAILURE Jun 10 00:18:24 an-launcher1002 systemd[1]: monitor_refine_eventlogging_analytics.service: Failed with result 'exit-code'.
Turns out the service just needed to be restarted; the dconf error was unrelated I guess