User:Razziabuissa/Plan to drain hadoop cluster

for production, draining cluster: shutting down input disabling camus timers on an-launcher

by disabling, no data flowing in

some jobs like refine are scheduled

Should drain in less than an hour

7-day retention in kafka; kafka used as buffer

now that we have capacity scheduler, you can disable queues

Plan:

disable puppet on an-master1002
- sudo puppet agent --disable 'razzi: upgrade hadoop masters to debian buster'
Disable jobs on an-launcher1002
- sudo systemctl stop 'camus-*'
- sudo systemctl stop 'drop-*'
- sudo systemctl stop 'hdfs-*'
- sudo systemctl stop 'mediawiki-*'
- sudo systemctl stop 'refine_*'
- sudo systemctl stop 'refinery-*'
- sudo systemctl stop 'reportupdater-*'
disable queue
- sudo systemctl stop hadoop-yarn-resourcemanager^[1]
kill yarn applications
- for jobId in $(yarn application -list | awk 'NR > 2 { print $1 }'); do yarn application -kill $jobId; done
enable safe mode
- sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
checkpoint
- sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
create snapshot tar
- sudo su
- cd /srv/hadoop/namenode
- tar -czf /home/razzi/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz current
copy snapshot to elsewhere
- (from my personal computer)
- scp -3 an-master1001.eqiad.wmnet:/home/razzi/hdfs-namenode-snapshot-buster-reimage-$(gdate --iso-8601).tar.gz thorium.eqiad.wmnet:/home/razzi/hdfs-namenode-snapshot-buster-reimage-$(gdate --iso-8601).tar.gz
  - Based on scp-ing a test file, this will take about 30 minutes; that's acceptable, but if there's a faster way (distcp?) it'd be good to know
change uids
reimage

stop the cluster

make a backup

change uids

reimage