Sel
The system event log is used to log hardware events directly to the BMC. This allows events to be logged even if the Os is unable to. Currently we have use ipmiseld
. We also export the ipmi_sel_logs_count
metric so we can detect when new events have been logged however at this point a manual investigations is required to investigate what action to take.
Addtional information
To get additional information one is able to use either the ipmi-sel
command on the server or the racadm
command getsel
. The later is more likely to give more precise information however we the former often gives the necessary information to further progress the resolution
ipmi-sel
ipmi-sel
comes from the freeipmi tools and tries to interpret the messages based on standardised codes. running the command without any commands will give you some information e.g.
$ sudo ipmi-sel
ID | Date | Time | Name | Type | Event
67 | Feb-14-2023 | 22:45:19 | SBE Log Disabled | Event Logging Disabled | Correctable Memory Error Logging Disabled ; OEM Event Data2 code = C0h ; OEM Event Data3 code = 40h
however you can get ipmi-sel to try and infer additional information by adding the following parameters --interpret-oem-data
and --entity-sensor-names
e.g.
sudo ipmi-sel --interpret-oem-data --entity-sensor-names
ID | Date | Time | Name | Type | Event
67 | Feb-14-2023 | 22:45:19 | System Firmware SBE Log Disabled | Event Logging Disabled | Correctable Memory Error Logging Disabled ; DIMM A7
racadm
Using racadm is likely to give you more accurate human readable information as it is using the dell interface, however this requires logging into the idrac interface
$ ssh cloudvirt1036.mgmt.eqiad.wmnet
root@cloudvirt1036.mgmt.eqiad.wmnet's password:
racadm>>getsel
Record: 67
Date/Time: 02/14/2023 22:45:19
Source: system
Severity: Critical
Description: Correctable memory error logging disabled for a memory device at location DIMM_A7.
-------------------------------------------------------------------------------
Alerts
When you receive an alert you should check the SEL using one of the aforementioned methods creating phabricator ticket for each actionable event then clear the log down using
Memory Errors
See Memory correctable errors -EDAC-
Clearing SEL
once all events have been logged you should clear the sel log down. this can be preformed either using the sudo ipmi-sel --clear