User:Razzi/T280132 disk swap
https://phabricator.wikimedia.org/T280132
First let me see how the host (an-worker1100) is doing
razzi@an-worker1100:~$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli === RaidStatus (does not include components in optimal state) === RaidStatus completed
Sweet
https://phabricator.wikimedia.org/T280132#7007970
> I did the following:
> - commented the disk in /etc/fstab > - umounted it manually - sudo umount /var/lib/hadoop/data/k > - ran puppet to regenerate the list of datadir for yarn and hdfs > - the yarn nodemanager was down due to this problem, but puppet brought it up again after 3)
- Uncommented disk
- Ran:
sudo mount -a mount: /var/lib/hadoop/data/k: can't find UUID=7bcd4c25-a157-4023-a346-924d4ccee5a0.
Ok, so I guess the disk has a new uuid
ls -l /dev/disk/by-uuid/
Hmm that shows only /dev/sdX, but I don't know which disk it is.
There's also this link: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Swapping_broken_disk
# From the previous commands you should be able to fill in the variables # with the values of the disk's properties indicated below: # X => Enclosure Device ID # Y => Slot Number # Z => Controller (Adapter) number megacli -PDMakeGood -PhysDrv[X:Y] -aZ
so now I'm on step 6 I want to do something like:
> Add the single disk RAID0 array (use the details from the steps above):
sudo megacli -CfgLdAdd -r0 [32:0] -a0
Given I have:
Adapter #0 ... Enclosure Device ID: 32 Slot Number: 11 Firmware state: Online, Spun Up
I will run
sudo megacli -CfgLdAdd -r0 [32:11] -a0
Ok I ran this but got:
Exit Code: 0x1a razzi@an-worker1100:~$ echo $? 26
Ok I see on this webpage: https://www.thomas-krenn.com/de/wiki/MegaCLI_Error_Messages
0x1a Maximum LDs are already configured
So maybe it's already configured. Let me try to proceed.
Well, I want to figure out which /dev/sd? it is, and I can't figure out how to figure out which one it is, but one of the uuids won't show up in /etc/fstab.
for u in $( ls /dev/disk/by-uuid/); do echo $u; cat /etc/fstab | grep $u; done
bash loop did the trick... e97258d2-5661-469a-9d34-56bd84a80714 is the one.
But wait, there's also 91c728b2-0dc9-4755-841c-ecdab46d38ae...
a7ab9126-4ef4-4824-a41c-69b4f8630edb
Hmm these are all dm-0, 1 2... not what I want. Maybe the disk isn't showing up yet
ls /dev/sd? | wc gives 23.
Yeah I think the disk isn't showing up. I'll comment on the task
Ok I copied the wrong part of the output
Enclosure Device ID: 32 Slot Number: 10 Firmware state: Unconfigured(good), Spun Up
That's the right disk.
razzi@an-worker1100:~$ sudo megacli -CfgLdAdd -r0 [32:10] -a0 Adapter 0: Created VD 10 Adapter 0: Configured the Adapter!!
Now it shows sdl as unused in lsblk:
sdk 8:160 0 1.8T 0 disk └─sdk1 8:161 0 1.8T 0 part /var/lib/hadoop/data/m sdl 8:176 0 1.8T 0 disk sdm 8:192 0 1.8T 0 disk └─sdm1 8:193 0 1.8T 0 part /var/lib/hadoop/data/q
Now I want the disk uuid, but it's not showing in blkid or /dev/disk/by-uuid/...
Oh right, it doesn't have a partition yet, and the partition has the uuid.
sudo parted /dev/sdl --script mklabel gpt sudo parted /dev/sdl --script mkpart primary ext4 0% 100% sudo mkfs.ext4 -L hadoop-k /dev/sdl1 sudo tune2fs -m 0 /dev/sdl1
Now lsblk shows its uuid. cb58c727-dec9-4abf-8b21-3d70a6443b6d
But it's not showing its space in lsblk...
sdl └─sdl1 ext4 hadoop-k cb58c727-dec9-4abf-8b21-3d70a6443b6d sdm └─sdm1 ext4 hadoop-q 766882b0-078f-4bc3-b118-3f8456446b52 380.2G 79% /var/lib/hadoop/data/q
It looks like it does mount, but doesn't stay.
Yep
Apr 29 15:38:09 an-worker1100 kernel: [1810702.143355] EXT4-fs (sdl1): mounted filesystem with ordered data mode. Opts: (null) Apr 29 15:38:09 an-worker1100 systemd[1]: var-lib-hadoop-data-k.mount: Unit is bound to inactive unit dev-disk-by\x2duuid-7bcd4c25\x2da157\x2d4023\x2da346\x2d924d4ccee5a0.device. Stopping, too. Apr 29 15:38:09 an-worker1100 systemd[1]: Unmounting /var/lib/hadoop/data/k... Apr 29 15:38:09 an-worker1100 systemd[1]: var-lib-hadoop-data-k.mount: Succeeded.
systemctl daemon-reexec
might fix it. It did!