Monitoring/check dsh groups
This Icinga alert checks if a MediaWiki appserver is a member of the mediawiki-installation
DSH group.
The DSH groups control which hosts Scap deploys code to. In the Puppet repo you can see the list of servers now comes from Hiera from ./hieradata/common/scap/dsh.yaml
where a reference is made to conftool.
mediawiki-installation: conftool: - {'cluster': 'appserver', 'service': 'apache2'} - {'cluster': 'api_appserver', 'service': 'apache2'} - {'cluster': 'jobrunner', 'service': 'apache2'} - {'cluster': 'testserver', 'service': 'apache2'}
The conftool data is in ./conftool-data/node
in the puppet repo as well. Check if the affected host name shows up in there. If not, you can add it.
Make sure first there is no existing hardware issue with this server by searching Phabricator for its host name.
If it is in there but you still get the alert, first run scap pull
to fetch the latest code and then pool
to add it to the pool. The Icinga alert should recover a little while later.
Alternatively you can pool the server from a management host such as cumin1001 using conftool commands.
Inactive servers
There are two levels of depooling. It can be set as pooled=no
(translates to enabled: False
in pybal) which means it receives no public traffic, but still receives Scap deployments. Or it can be set as pooled=inactive
which also removes it from the DSH group. This is generally only used if a host is unable to receive code updates, which then helpfully avoids Scap deployment errors due to unreachable servers.
If a server has come back online from maintenance or downtime and starts issuing Host not in mediawiki-installation dsh group alert, it is recommended to run scap pull
to ensure it will not be serving outdated code to monitoring requests in production (T310225), and set pooled=no
so that it receives Scap deployments going forward.
Once the maintenance or repair ticket is updated/resolved, and any other verification has taken place, it can also be repooled again.
History
Historically, before we had Salt or Cumin, we used DSH to run commands on multiple servers at once.
Server groups were text files in the Puppet repository and mediawiki-installation
was one of them. Taking a server out of the pool meant making an edit to this text file. The use case of pooling severs is now managed via conftool/confctl. See also pool/depool app servers.