SRE/Dc-operations/Platform-specific documentation/ServerTech
SRE Data Center Operations
DC Operations | About | Projects & Workboards | IRC: #wikimedia-dcops connect
HW Troubleshooting | HW Specific Documentation
ServerTech CDUs are smart power supplies with support for power and environmental monitoring, and (some models) outlet control.
Issues
- After ANY IP/Network/SNMP/syslog changes, the settings must be saved and the PDU/CDU restarted. Restarting the PDU/CDU does NOT interrupt power delivery, and all powered gear should remain unaffected.
Initial Setup
- connect the PDU to it's serial connection, then connect and login to the PDU via serial
- default user/pass is admn/admn
- immediately run the following to set the ip info so you can connect via HTTPS to configure the remainder (example is for PDUs in eqiad, you'll need to change for your site's network info):
set dhcp disabled
set ipv4 address <ip address>
set ipv4 gateway 10.65.0.1
set ipv4 subnet 255.255.0.0
set dns primary 10.3.0.1
set dns secondary 10.3.0.1
reboot
- Confirm reboot, you should now be able to connect to the device via https://ps1-rack-site.mgmt.site.wmnet to setup the remainder
- Connect via https and login with the default admn/admn, we'll change this first.
- Go to Configuration > Access > Local Users >
- Create Users and create the root user, using the mgmt-pdu password.
- Edit the root user, and change Access Level to Administrator
- Log out, and log back in as root
- Go to Configuration > Access > Local Users >
- Delete the default admn user.
- Go to Configuration > Network > DHCP/IP
- Configure DHCP settings to set the FQDN to the pdu FQDN
- Un-check Zero Touch Provisioning
- Apply Settings (don't reboot yet, we'll do that when we are done)
- Go to Configuration > Network > HTTP/HTTPS
- Un-check Enable under HTTP server (because unsecured traffic is a bad idea)
- Apply Settings (don't reboot yet, we'll do that when we are done)
- Go to Configuration > Network > SNMP
- Set the SNMPv2 Agent GET and SET passwords (you should copy this from another PDU)
- Set the system name (FQDN)
- Set the system location (site)
- Set system contact: Wikimedia <noc@wikimedia.org>
- Apply Settings (don't reboot yet, we'll do that when we are done)
- Go to Configuration > Network > SMTP
- Set primary and secondary hosts (depending on which is closest site to PDU) as ntp.eqiad.wikimedia.org and ntp.codfw.wikimedia.org
- Apply Settings (don't reboot yet, we'll do that when we are done)
- Go to Configuration > Network > Syslog
- Set both Host 1 and Host 2 to syslog.anycast.wmnet
- Apply Settings
- Now we will reboot the PDU for it to apply the above, go to Tools > Restart > Action : Restart and Apply.
- PDU will restart and apply all the updated settings, should immediately start showing in librenms.
Adding items to scs
- using a browser go to the scs example:scs-a1-sdtpda.mgmt.pmtpa.wmnet
- add the pdu to the appropriate port
- connect the pdu to the scs switch (be sure to use cisco wire cfg)
- From terminal go to root@scs-a1-sdtpa.mgmt connect to the port using commands
- #pmshell
- #“port #”
- Run the following commands be sure to set the ip address to the appropriate ip address
- set dhcp disabled
- set ipaddress “10.1.5.21”
- set subnet 255.255.0.0
- set gateway “10.1.0.1”
- restart
After the pdu restarts
- Using a web browser go to the web page and accept the security certificate (you can use the IP Adress in the address bar)
- Login to the pdu with the default user/passwd admn/admn
- Go to users and create user "root" and give it the management password. Then click apply
- Under the action link use edit and select admin as the role and click apply.
- Logout of the pdu and then login again as user: root with the mgmt password
- go to users and delete the admn user.
Adding to Netbox
- Ensure the IP address for the PDU has the box checked for 'Make this the primary IP for the device/VM" or it will fail when added to librenms.
Setting up the Configuration
Configuration:
System:
About:
Location: ulsfo
Bluetooth:
Enable unchecked
Network:
DHCP/IP:
Primary DNS: 10.3.0.1
IPv4 Address/mask/gateway: SET
FQDN: SET
DHCP: Uncheck
FTP:
FTP Server: Uncheck*
HTTP/HTPS:
HTTP Server: Uncheck*
HTTPS server: Verify Checked
SNMP:
SNMPv2 Agent: Enable
GET Community: Set to SNMP secret*
SET Community: Empty
System Name: Set to hostname
System Location: Set to site code (ulsfo, etc.)
System Contact: noc@wikimedia.org
SNTP:
Primary host: ntp.eqiad.wikimedia.org
Secondary host: ntp.codfw.wikimedia.org
Syslog:
Host 1: syslog.anycast.wmnet
Telnet/SSH:
Telnet server: Uncheck*
SSH server: verify checked
Access:
Local Users:
admn: remove
Tools:
restart:
Action: Restart
* Restart required
Adding devices to monitoring
Add device to LibreNMS: LibreNMS#Add a device to LibreNMS
Add device to Icinga: duplicate this change
Cookbooks
There are a number of cook books to add in the management of this HW. run cookbook -lv sre.pdus
from a Cumin host to see a list of the current cook books.
$ sudo cookbook -lv sre.pdus
cookbooks
`-- sre: SRE Cookbooks
`-- sre.pdus: -
|-- sre.pdus.reboot-and-wait: List PDU 🔌 uptime
|-- sre.pdus.rotate-password: Update Sentry PDUs 🔌 passwords
|-- sre.pdus.rotate-snmp: Update Sentry PDUs 🔌 SNMP communities
`-- sre.pdus.uptime: List PDU 🔌 uptime
PDU's
The pdu cook books live under cookbook -l sre.pdus
and the all have the following common arguments
- --username USERNAME: the username to use to login to the PDU
- --check-default: if this flag is passed the script will check if the default username and password is still configured
- query: Either the word
all
or a list of PDU IPs or netbox device names. the cook book will run against all PDU's which have a valid entry in netbox
sre.pdus.uptime
This cookbook simply reports the PDU uptime (in unix uptime format) and the PDU version. This cook book has no additional arguments simply run as follows
$ sudo cookbook sre.pdus.uptime all
START - Cookbook sre.pdus.uptime
Current password:
10.65.0.55: uptime 6 days 16 hours 6 minutes 8 seconds
10.193.0.34: uptime 6 days 21 hours 18 minutes 10 seconds
10.65.0.48: uptime 4 days 18 hours 0 minutes 27 seconds
10.65.0.39: uptime 6 days 17 hours 36 minutes 45 seconds
END (PASS) - Cookbook sre.pdus.uptime (exit_code=0)
sre.pdus.reboot-and-wait
This cookbook can be used to perform a rolling reboot of a set of PDU's. It reboots the PDU's in order and waits for each PDU to fully reboot before moving on.
This cookbook has the following additional arguments
- --since SINCE: By default the cookbook will try to reboot all nodes, however you can pass an integer with since representing a number of seconds, ensuring the cookbook only reboots nodes which have not been rebooted since value
$ sudo cookbook sre.pdus.reboot-and-wait ps1-a8-codfw
START - Cookbook sre.pdus.reboot-and-wait
Current password:
10.193.0.32: rebooting Sentry v3 PDU
10.193.0.32: sleep while reboot
Failed to call 'cookbooks.sre.pdus.wait_reboot_since' [1/25, retrying in 5.00s]:
Failed to call 'cookbooks.sre.pdus.wait_reboot_since' [2/25, retrying in 5.00s]:
Failed to call 'cookbooks.sre.pdus.wait_reboot_since' [3/25, retrying in 5.00s]:
Failed to call 'cookbooks.sre.pdus.wait_reboot_since' [4/25, retrying in 5.00s]:
10.193.0.32: found reboot since 2020-09-22 09:38:10.464170
END (PASS) - Cookbook sre.pdus.reboot-and-wait (exit_code=0)
sre.pdus.rotate-snmp
This cook book is used to update all SNMP readonly strings to a specific value and optionally update all SNMP read/write strings to a random value
This cookbook has the following additional arguments:
- --force: if passed force and update of the snmp and a reboot of the PDU even if the RO SNMP configured is correct
- --reset-rw: update the RW snmp string to a random value
$ sudo cookbook sre.pdus.rotate-snmp ps1-a8-codfw
START - Cookbook sre.pdus.rotate-snmp
Enter login password:
New SNMP RO String:
Again, just to be sure:
10.193.0.32: Updating SNMP RO
10.193.0.32: SNMP RO: updated
10.193.0.32: rebooting Sentry v3 PDU
10.193.0.32: sleep while reboot
Failed to call 'cookbooks.sre.pdus.wait_reboot_since' [1/25, retrying in 5.00s]:
Failed to call 'cookbooks.sre.pdus.wait_reboot_since' [2/25, retrying in 5.00s]:
Failed to call 'cookbooks.sre.pdus.wait_reboot_since' [3/25, retrying in 5.00s]:
Failed to call 'cookbooks.sre.pdus.wait_reboot_since' [4/25, retrying in 5.00s]:
10.193.0.32: found reboot since 2020-09-22 09:10:50.040743
END (PASS) - Cookbook sre.pdus.rotate-snmp (exit_code=0)
$ sudo cookbook sre.pdus.rotate-snmp ps1-a8-codfw
START - Cookbook sre.pdus.rotate-snmp
Enter login password:
New SNMP RO String:
Again, just to be sure:
10.193.0.32: SNMP communities already match (version: 3, uptime: 0 days 0 hours 5 minutes 45 seconds)
END (PASS) - Cookbook sre.pdus.rotate-snmp (exit_code=0)
sre.pdus.rotate-password
This cookbook is used to rotate a user password
$ sudo cookbook sre.pdus.rotate-password 10.193.0.32
START - Cookbook sre.pdus.rotate-password
Current password:
New password:
Again, just to be sure:
10.193.0.32: Password updated successfully 😌
END (PASS) - Cookbook sre.pdus.rotate-password (exit_code=0)
Other Info
powering off and on outlets
These instructions are good for Sentry Switched CDU 6.0 firmware. (Manuals at servertech's site, see the manuals for Firmware Version 6.0). They may well work for other versions of the firmware but haven't been tested there.
Please Note: We have a very limited number of switched powerstrips in place. Only the network racks (A1-sdtpa, A1-eqiad, A8-eqiad) and B1-sdtpa have these. The rest have normal non-switched powerstrips. The reasoning behind this is that the switched strip has more parts and complexity and can have issues more easily than the more simple strips. Networking kit does not normally have full out of band lights management, so those racks have them. B1-sdtpa is a legacy rack to have older servers that do not have full lights out management remote reboot capabilities.
- Find the pdu for your rack and your server. See Netbox
- It will have a name like ps1-b1-sdtpa.mgmt.pmtpa.wmnet (powerstrip-n, rack m, dc name...)
- ssh in, use standard credentials for the manangement network.
- Check what outlets or groups of outlets you want. You can list those by:
list user
- At the "Username:" prompt, type
root
. The outlets are first, the groups at the bottom.
- To power the outlet off, do
off name-of-outlet-or-group-here
- Example:
off .BC6
oroff dataset1-all
- To power the outlet on, do
on name-of-outlet-or-group-here
Sample successful output from a power off command:
Switched CDU: off dataset1-all Group: dataset1-all Outlet Outlet Outlet Load Power Control ID Name Status (Amps) (Watts) State .AC6 dataset1_a:xz:6 Off 0.00 0 Off .AC7 dataset1-array1_z:xz:7 Off 0.00 0 Off .BC6 dataset1_b:xz:6 Off 0.00 0 Off .BC7 dataset1-array1_b:xz:7 Off 0.00 0 Off Command successful Switched CDU:
Power on gives similar results.
Note that there's a five minute timeout so if you idle too long you'll have to reconnect.
summary of other commands
A few other useful commands:
- From the command line, hitting return/enter will show you a list of commands it knows.
- Check the version of the firmware by
show system
. It also announces the firmware version when you login. - Some commands for displaying various statuses: show ports/network/options/towers/infeeds/traps/system
- A couple monitoring commands: istat, envmon, sysstat
- Listing user info: list user (you will be prompted for a specific user), list users (you will be shown a list of all users)