User:Razzi/Plan to drain hadoop cluster
Appearance
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html
for production, draining cluster: shutting down input disabling camus timers on an-launcher
by disabling, no data flowing in
some jobs like refine are scheduled
Should drain in less than an hour
7-day retention in kafka; kafka used as buffer
now that we have capacity scheduler, you can disable queues
Plan:
- disable puppet on an-master1002
- sudo puppet agent --disable 'razzi: upgrade hadoop masters to debian buster'
- Disable jobs on an-launcher1002
- sudo systemctl stop 'camus-*'
- sudo systemctl stop 'drop-*'
- sudo systemctl stop 'hdfs-*'
- sudo systemctl stop 'mediawiki-*'
- sudo systemctl stop 'refine_*'
- sudo systemctl stop 'refinery-*'
- sudo systemctl stop 'reportupdater-*'
- disable queue
- sudo systemctl stop hadoop-yarn-resourcemanager[1]
- kill yarn applications
- for jobId in $(yarn application -list | awk 'NR > 2 { print $1 }'); do yarn application -kill $jobId; done
- enable safe mode
- sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
- checkpoint
- sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
- create snapshot tar
- sudo su
- cd /srv/hadoop/namenode
- tar -czf /home/razzi/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz current
- copy snapshot to elsewhere
- (from my personal computer)
- scp -3 an-master1001.eqiad.wmnet:/home/razzi/hdfs-namenode-snapshot-buster-reimage-$(gdate --iso-8601).tar.gz thorium.eqiad.wmnet:/home/razzi/hdfs-namenode-snapshot-buster-reimage-$(gdate --iso-8601).tar.gz
- Based on scp-ing a test file, this will take about 30 minutes; that's acceptable, but if there's a faster way (distcp?) it'd be good to know
- change uids
- reimage
stop the cluster
make a backup
change uids
reimage