SRE/Dc-operations/Platform-specific documentation/Dell Documentation
SRE Data Center Operations
DC Operations | About | Projects & Workboards | IRC: #wikimedia-dcops connect
HW Troubleshooting | HW Specific Documentation
- Lights Out Manager: Dell iDRAC8 (14 Gen Servers, R440), Dell iDRAC9 (Dell 15 Gen servers, R450)
- We always purchase the enterprise version and license, allowing for a dedicated network port, rather than using the ports on the primary Ethernet interfaces.
Lights Out Management
Idrac 8 and 9 changed to an always in racadm mode:
racadm>>
Common Actions
Show logs
getsel lclog view
Also check out the main Sel page
Reboot and boot from network then console
set iDRAC.ServerBoot.FirstBootDevice PXE serveraction powercycle console com2
Note: If having trouble with PXE boots, try Ctrl+S to enter the card-level setup for Broadcom (if applicable), and set the boot protocol to PXE instead of NONE inside the card settings. This worked for some new 10G cards in R620's (BCM578xx).
Reboot and boot into BIOS then console
set iDRAC.ServerBoot.FirstBootDevice BIOS serveraction powercycle console com2
Connecting to mgmt interface
- Via SSH (the management password is in pwstore)
- ssh root@$HOSTNAME.mgmt.$DATACENTER.wmnet
- Example: ssh root@bast1001.mgmt.eqiad.wmnet
- Via Browser
- Sometimes userful in troubleshooting, but you must have set up some kind of http(s) proxy into the mgmt network, demonstrated at Proxy_access_to_cluster
- https://$HOSTNAME.mgmt.$DATACENTER.wmnet
- Please note you will have to override an unknown (self signed) certificate, you won't want to save it permanently, as a few of these saved tends to result in errors connecting to other Dell DRAC interfaces via HTTPS.
Connecting to Serial Console
- Attach to the serial console: console com2
- Detach from serial console: ctrl+\
- Console Redirection Key Mappings:
Use the <ESC><1> key sequence for <F1> Use the <ESC><2> key sequence for <F2> Use the <ESC><3> key sequence for <F3> Use the <ESC><0> key sequence for <F10> Use the <ESC><!> key sequence for <F11> Use the <ESC><@> key sequence for <F12> Use the <ESC><Ctrl><M> key sequence for <Ctrl><M> Use the <ESC><Ctrl><H> key sequence for <Ctrl><H> Use the <ESC><Ctrl><I> key sequence for <Ctrl><I> Use the <ESC><Ctrl><J> key sequence for <Ctrl><J> Use the <ESC><X><X> key sequence for <Alt><x>, where x is any letter key, and X is the upper case of that key Use the <ESC><R><ESC><r><ESC><R> key sequence for <Ctrl><Alt><Del>
If you get locked out of the console, you can reset racadm:
racreset
This can happen if your network connection dies while you are logged into a console session.
Changing BIOS Boot Order
Sometimes the PXE boot order is wrong. The chassis only tries to boot off PXE once; if the first PXE option fails, it immediately boots off the hard drive. For a successful PXE boot, you must set the correct NIC as the first PXE option.
Get current boot order (tested on iDRAC 9 only)
get BIOS.BiosBootSettings.BootSeq
(example response)
[Key=BIOS.Setup.1-1#BiosBootSettings]
BootSeq=HardDisk.List.1-1,NIC.Embedded.1-1-1,NIC.Slot.2-1-1
Change boot order
set BIOS.BiosBootSettings.BootSeq HardDisk.List.1-1,NIC.Slot.2-1-1,NIC.Embedded.1-1-1
Stage job for next reboot using "jobqueue" command
jobqueue create BIOS.Setup.1-1
(example response)
RAC1024: Successfully scheduled a job.
Verify the job status using "racadm jobqueue view -i JID_xxxxx" command.
Commit JID = JID_576799115492
At this point, you can reboot the server using the below commands.
Power cycling
Log in with SSH on the mgmt interface:
racadm serveraction action (for older iDrac) serveraction action (for iDrac 8/9)
- Where action is one of the following:
- powerdown - power server off
- powerup - power server on
- powercycle - perform server power cycle
- hardreset - force hard server power reset
- powerstatus - display current power status of server
- Alternatively, use the SM CLP shell, after logging in, use the following commands:
reset /system1 stop /system1 start /system1
Administrative Actions
Updating Firmware
Dell firmware can be updated via the idrac command line (requires FTP server and we're still working on this setup) or https mgmt interface. When updating via HTTPS interface, you can queue up multiple firmware revisions for bios, NIC, raid, and/or idrac together and it will apply all the non-idrac and then the idrac firmware updates.
Dell firmware can be downloaded directly from Dell, without login, via the 'Dell Config' link under every server's netbox details page. Click Dell Config, then 'Drivers and Downloads' on the Dell Support Site.
<todo: list standard file format naming and firmware names required>
Its also possible to use the sre.hardware.upgrade-firmware
cookbook to upgrade the bios, idrac and nic. By default the cookbook will attempt to upgrade the bios and idrac to the most recent version.
$ sudo cookbook sre.hardware.upgrade-firmware --help
usage: cookbook [-h] [--no-reboot] [--disable-cached-answers] [--yes] [--firmware-store FIRMWARE_STORE] [-f] [-n]
[-c {bios,idrac,nic,storage}]
query
Audit and possibly update firmware.
Usage example:
cookbook sre.hosts.firmware 'example1001*'
positional arguments:
query Cumin query to match the host(s) to act upon.
optional arguments:
-h, --help show this help message and exit
--no-reboot don't perform any reboots. Updates will not be installed unless the user preforms a manual
reboot (default: False)
--disable-cached-answers
By default this cookbook caches the answers for firmware selection. Add this to disable
the behaviour (default: False)
--yes, -y Don't prompt for confirmations (default: False)
--firmware-store FIRMWARE_STORE, -S FIRMWARE_STORE
The location where firmware is stored (default: None)
-f, --force force the upgrade even if the firmware allready matches (default: False)
-n, --new The server is a new server and as such not in puppetdb. (default: False)
-c {bios,idrac,nic,storage}, --component {bios,idrac,nic,storage}
force a specific type of upgrade: bios, idrac, nic, storage (default: None)
Commonly updated firmwares: idrac, bios, network, raid, backplane.
Urgent Firmware Revision Notices:
- Broadcom NetExtremeE firmware for BCM57412 10G nic should only upgrade to 21.85.21.92, as 22.00.07.60 breaks installer. If you don't want to rollback the firmware, you can use the
--force-dhcp-tftp
flag when running the reimage cookbook, such assudo cookbook sre.hosts.reimage --force-dhcp-tftp --new --os bullseye example1001 -t T1234
- Broadcom NetExtreemeE firmware for BCM57414 10/25G nic must run 21.60.22.11, newer causes PXE boot issues including "Failed to load ldlinux.c32" when a SFP+ (10G) module is used. To avoid rolling back the firmware, you can use the
--force-dhcp-tftp
flag when running the reimage cookbook, such assudo cookbook sre.hosts.reimage --force-dhcp-tftp --new --os bullseye example1001 -t T1234
- iDrac9 firmware for PowerEdge servers should stop just shy of 6.0, as 6 and above has issues with HTTPS interface.
- iDrac8 firmware should not up[grade beyond iversion 2.80.80.80 for the same reason. See this page for an explanation of the new HTTPS header checks.
Rolling back Firmware updates
The rollback feature is a tab in the HTTPS interface next to the update feature tab. If the HTTPS interface is unavailable (say by updating idrac/8 past version 2.51), then a crash cart can be connected and the system rebooted into the Lifecycle Controller (key entry required during post via crash cart) and then iDRAC Settings > Update and Rollback firmware > Rollback firmware, where you can select the version/increment to rollback.
NICs
You can identify which type of NIC the server has from the web interface under System -> Inventory -> Hardware inventory (there are multiple pages).
The Broadcom 10G NICs will be something like "NIC in Slot N Port Y - PCI Device" and the Description field will contain the exact model, typically NetXtreme-E. Alternatively if you have OS access to the host, lshw will display the same information.
With the NIC model you can download the driver from the Netbox shortlink, the web interface will accept firmware in windows (.exe) format.
Firmware Device Ordering
At WMF, we typically install the OS on /dev/sda
. However, new chassis that have a PERC 11 card will prioritize devices connected to that card. Consider a scenario where you have a SATA SSD that you'd like to use as the OS disk, and a hardware RAID virtual disk you'd like to use to store data. With the PERC 11, your SATA SSD will appear as /dev/sdb
or later.
If you'd like to avoid this (and avoid writing a new partman recipe), you can enable Firmware Device Ordering :
- At boot time, press F2 to enter system setup
- From system setup, choose "Device Settings," then choose your PERC device.
- Choose "View Server Profile" (yes, they hid write actions behind a "view" menu :( )
- Choose "Advanced Controller Properties"
- Set "Firmware Device Order" and set the order.
- Reboot
See the linked article for more details.
HTTPS Mgmt Interface
- This requires you have an SSH tunnel into our mgmt network via a cumin host.
- Pull up and login (user root, and mgmt password) to the https mgmt interface. Example: https://mw1446.mgmt.eqiad.wmnet
- Maintenance > System Update
- Please note when updating firmware, you can upload multiple files and apply them all in a single batch action, rather than individually uploading and applying. Simply upload the files, and once all are uploaded check all boxes and click 'Install and Reboot'.
- The order of operations seems to apply Bios first, then network (and raid/backplane, not tested for order), then idrac last when batched together.
Polling for MAC Address
- SSH into iDRAC interface.
- Info command:
nicstatistics
- Ensure you pick out the proper MAC address for the correct interface (careful of 1G and 10G numbering):
racadm>>nicstatistics NIC.Embedded.1-1-1:Broadcom Gigabit Ethernet BCM5720 - 2C:EA:7F:7F:C0:B2 PartitionCapable : Not Capable NIC.Embedded.2-1-1:Broadcom Gigabit Ethernet BCM5720 - 2C:EA:7F:7F:C0:B3 PartitionCapable : Not Capable
Changing iDRAC User Password
- SSH into iDRAC interface.
- Change command:
racadm config -g cfgUserAdmin -o cfgUserAdminPassword -i 2 <newpassword>
iDRAC SSH Key Based Authentication
- This is not yet standard on all hosts. RobH is working on it, and merely listing the commands here for posterity.
- On Dell systems, the root user is userid # 2. (#1 is given to the disabled anon access user.)
- We only assign a single ssh key per user, though up to 4 can be assigned.
- SSH into iDRAC interface.
- List all keys assigned to a user
- Specific key:
racadm sshpkauth -i <2 to 16> -v -k <1 to 4>
- All keys:
racadm sshpkauth -i <2 to 16> -v -k all
- List root user keys:
racadm sshpkauth -i 2 -v -k all
- Add key:
racadm sshpkauth -i <2 to 16> -k <1 to 4> -t <key-text>
- Add key to slot one for root user:
racadm sshpkauth -i 2 -k 1 -t "contents of the public key file line"
- Delete key:
- Specific Key:
racadm sshpkauth -i <2 to 16> -d -k <1 to 4>
- All Keys:
racadm sshpkauth -i <2 to 16> -d -k all
- Delete all root user keys:
racadm sshpkauth -i 2 -d -k all
Changing the iDRAC Network IP Settings
racadm setniccfg -s <ipaddres> <subnetmask> <gateway>
Enable / Disable IPMI over DRAC
racadm config -g cfgIpmiLan -o cfgIpmiLanEnable <0 or 1>
- 0 is off, 1 is on.
Setting a one-time boot option
Sometimes you want the server to reboot into a network boot, or into bios directly. You can set one-time boot options with the following on the mgmt SSH command line:
racadm config -g cfgServerInfo -o cfgServerBootOnce 1 racadm config -g cfgServerInfo -o cfgServerFirstBootDevice <BOOT OPTION>
- Valid boot option targets: No-Override, PXE, HDD, DIAG, CD-DVD, BIOS, vFDD, VCD-DVD, iSCSI, VFLASH partition label, FDD, SDe, RFS (Remote File Share)
- We most commonly just make use of PXE & BIOS
Changing the User Defied String for the front LCD
- Please note only models with LCD have this set: R620, R720
- Checking the string:
racadm get System.LCD.LCDUserString
- Setting new string:
racadm set System.LCD.LCDUserString newstring
- We need to test this on a new system out of box. Rather than set the display User specified string and string in DRAC setup, skip that, and attempt to set with this single command afterwards. Ideally populating this via the iDrac command line will force it to also be set to display on LCD (rather than default service tag.) Please let RobH know the result of this test (or just put it in here!) --RobH (talk)
Troubleshooting
- 10Gb NIC systems will occasionally need to have the legacy PXE boot option specifically set in the NIC bios. If you have it hitting the PXE boot step, and simply halting, this setting is not correct.
- For bullseye installs, if DHCP fails or there are issues with the partitioning, a firmware upgrade for the iDRAC and NIC is required. As of February 2023 and for the cp hosts in bullseye, we observed that
iDRAC: 6.10.00.00
andNIC: 21.85.21.92
(not 22.x) work the best.- bullseye installer issues? Please note that the firmware cookbook doesn't work with iDRAC versions less 3.30. If the host iDRAC firmware is 3.15, you will need to manually update to 3.30 via the HTTP management interface and then run the cookbook for the rest of the upgrade. Please see https://phabricator.wikimedia.org/T321309#8581111 for more information.
- If attempts to connect to the web UI are failing with status 500 (this has been observed after firmware upgrades), try
racadm set IDRAC.WebServer.HostHeaderCheck 0
.
Initial System Setup
Automatic setup
The inital setup for Dell servers can be done running the sre.hosts.provision
cookbook. The only pre-requisite is that the server is racked and plugged with both power and network cables. No additional step is needed, nor the requirement to attach a physical console to the host.
For a new host just run from within a tmux/screen session the cookbook with the hostname (not the FQDN) as it is defined in Netbox, for example:
sudo cookbook sre.hosts.provision example1001
See the cookbook's help (with -h
or --help
) for more details on all the steps performed, its current output (Dec. 2021) is:
$ sudo cookbook sre.hosts.provision -h
usage: cookbook [-h] [--no-dhcp] [--no-users] [--enable-virtualization] host
Provision a new physical host setting up it's BIOS, management console and NICs.
Actions performed:
* Validate that the host is a physical host and the vendor is supported (only Dell at this time)
* Fail if the host is active on Netbox but --no-dhcp and --no-users are not set as a precautionary measure
* [unless --no-dhcp is set] Setup the temporary DHCP so that the management console can get a connection and
become reachable
* Get the current configuration for BIOS, management console and NICs
* Modify the common settings
* [if --enable-virtualization is set] Leave virtualization enabled, by default it gets disabled
* Push back the whole modified configuration
* Checks that it can still connect to Redfish API
* Checks that the configuration has been applied correctly dumping the new configuration and trying to apply
the same changes. In case it detects any non-applied configuration will prompt the user what to do. It can
retry to apply them, or the user can apply them manually (via web console or ssh) and then skip the step.
* [unless --no-users is set] Update the root's user password with the production management password
* Checks that it can connect via remote IPMI
Usage:
cookbook sre.hosts.provision example1001
cookbook sre.hosts.provision --enable-virtualization example1001
cookbook sre.hosts.provision --no-dhcp --no-users example1001
positional arguments:
host Short hostname of the host to provision, not FQDN
optional arguments:
-h, --help show this help message and exit
--no-dhcp Skips the DHCP setting, assuming that the management console is already reachable (default: False)
--no-users Skips changing the root's user password from Dell's default value to the management one. Uses the management passwords also for the first connection (default:
False)
--enable-virtualization
Keep virtualization capabilities on. They are turned off if not speficied. (default: False)
Troubleshooting
Failed to perform GET request to https://$HOSTNAME.mgmt.$DC.wmnet/redfish
If the cookbook fails early on in its run with Failed to perform GET request to https://$HOSTNAME.mgmt.$DC.wmnet/redfish
, the most likely reason is that the iDRAC was unable to get an IP address from the DHCP. This might happen for various reasons, one of which is a typo in the device's Service Tag in Netbox. In order to troubleshoot the issue follow these steps:
- On the cumin host from where the cookbook was run, we need to find what was the DHCP snippet sent to the install server for the provisioned host. It can be extracted from the cookbook log with:
eval $(sudo grep '/usr/bin/base64' /var/log/spicerack/sre/hosts/provision.log | grep $MY_HOSTNAME | tail -n1 | grep -o "/bin/echo.*base64 -d")
(where$MY_HOSTNAME
is the hostname passed to the cookbook). - On the install server run a tcpdump for a minute or so to check what DHCP requests are coming and what's their
Hostname Option
with:sudo tcpdump -vvv 'udp and (src port 67 or src port 68 or src port 69)' | grep 'Hostname '
- Check if the Service Tag from the DHCP setting is present in the tcpdump output. If not it's possible that the Service Tag in Netbox does not match the one of the device. Correct it on Netbox before retying the cookbook.
Unable to connect to the Redfish API
In case you get this error early on in the provision cookbook, it's possible that the credentials have been already changed on the host's iDRAC but the --no-users
CLI flag of the cookbook was not set. To confirm this check the cookbook's logs and see if there is a message like the following one in your run:
spicerack.redfish.RedfishError: GET https://xx.xx.xx.xx/redfish returned HTTP 401 with message:
{
"error": {
"code": "Base.1.8.GeneralError",
"message": "A general error has occurred. See ExtendedInfo for more information.",
"@Message.ExtendedInfo": [
{
"@odata.type": "#Message.v1_1_0.Message",
"MessageId": "Base.1.8.AccessDenied",
"Message": "The authentication credentials included with this request are missing or invalid.",
"MessageArgs": [],
"MessageArgs@odata.count": 0,
"RelatedProperties":[],
"RelatedProperties@odata.count": 0,
"Severity": "Critical",
"Resolution": "Attempt to ensure that the URI is correct and that the service has the appropriate credentials."
}
]
}
}
If that's the case, pass the --no-users
flag to the cookbook and retry.
Manual steps
We need to change a number of options in the bios and mgmt configuration to our own specifications. Some of these items MUST be done locally by the on-site technician racking the systems. As such, all of the following steps should be done by the on-site before completing a new server in the racked state.
- Rack server according to directions on how to do so (including in racking task)
- Attach physical console to system (keyboard & monitor).
- Boot system, and enter BIOS by pressing F2 during POST.
- System POSTS and enters BIOS with screen listing: System Bios, iDRAC Settings, & Device Settings.
- Please note the first boot, this will be in a GUI that can be driven with keyboard only. AFTER the serial redirection is setup, this menu will no longer display on the physical console with the GUI, but a non-graphical command-line type menu system.
- Enter System Bios.
- We will be changing the entries for a number of items. If an item is not listed, it doesn't need to change from defaults.
- Processor Settings
- Logical Processor set to enable - This is hyperthreading, and sometimes we don't need it. If unsure, ask some application-specific expert. HHVM and Elasticsearch greatly benefit from it.
- Virtualization Technology set to disabled - leaving this on when not using system for virtual machines leaves a potential security vector.
- Serial Communication
- Serial Communication set to: On with console redirection via COM2
- Serial Port Address set to Serial Device1=COM1,Serial Device2=COM2
- External Serial Connector: Serial Device 1
- Failsafe Baud Rage: 115200
- Remote Terminal Type: VT100/VT200
- Redirection after boot set to disabled - newer ubuntu versions prefer this during boot, though you may swap it back when troubleshooting PXE boot issues.
- System Profile Settings
- System Profile set to Performance Per Watt (OS) - the default dell setting causes power_saving/watchdog kernel threads to spawn and inordinately consume CPU cycles, this setting fixes it.
- Miscellaneous Settings
- Asset Tag set to the Asset Tag assigned when system was received into stock.
- Hit ESC to exit out, when prompted saving your changes. Do not exit the BIOS screen entirely, just go back to main settings screen (System Bios, iDRAC Settings, & Device Settings)
- Select iDRAC Settings
- Network
- Confirm the following setttings:
- Enable NIC is set to Enabled - should already be set to this, just a double-check
- Nic Selection is set to Dedicated (iDRAC7 Enterprise only) - should be set to this already, if it won't change, it means the DRAC Enterprise License did not apply correctly during purchase. Please contact RobH or cmjohnson if this occurs so we can get it fixed.
- Set the Static IP, Static Gateway, & Static Subnet Mask.
- We don't assign DNS servers to mgmt interfaces.
- Enable IPMI Over LAN set to enabled - this will allow us to use IPMI commands & scripting in the future.
- Front Panel Security (Only for systems with front LCD)
- Set LCD message set to User-Defined String
- User-defined string set to system name if available, otherwise input asset tag again.
- User Configuration
- Set password to the mgmt password
- ESC and save when prompted until back out of BIOS entirely, all settings are now in place.
- All systems must be tested for DRAC and console redirection before the racking and on-site work is complete.
- Connect to DRAC via SSH
- Test powercycling, powering down, and powering up.
- Test console redirection, ensure you can watch system POST via SSH mgmt session.
- Once testing has passed, system is ready for operations allocation.
Perc H750 Raid Controllers (Virtual Drive ID, Boot Order)
Please note the order you create virtual disks in the 15th generation Perc H750 controller is the opposite of past controllers. The first entered will have the highest device ID (thus the furthest from sda in the OS view). You want to create the data disk virtual drive array, then the OS disk virtual drive array.
Don't forget to set the boot virtual disk array in the raid controller configuration, then exit, reboot back into the bios after a powercycle, and set the boot order in the bios to drive C first, then NIC. (The main bios couldn't see the new array for the OS until after you set it up and rebooted.)
If virtual disk arrays are added after OS installation, the OS should still boot correctly as grub will be installed and config will point to UUID versus just 'sdx' for the boot of OS data.