Wikimedia Cloud Services team/EnhancementProposals/WMCS-SRE-book
More information and discussion about changes to this draft on the talk page.
The WMCS SRE book is a collection of agreement, best practices and engineering operations workflows that we all in the team try to follow and apply in our day to day work.
You can find here hints and tips for common tasks and situations we face. It is not intended to contain very technical information (like how to restart service X).
Community interaction/communication
This section describes our practices for interacting with the community for some special operations, including but not limited to:
- deprecation plans (we are shutting down a service or are no longer supporting a given technology)
- expected downtime (we have to do a planned operation in a window)
- outage/incident communication (we suffered an unexpected issue that affected our users/community)
Communication channels
We interact with our users using different mechanisms:
- Using Phabricator tasks and a number of Phabricator tags/workboards.
- On IRC, in the #wikimedia-cloud connect channel on libera.chat.
- Via mailing lists: cloud-announce@lists.wikimedia.org, cloud@lists.wikimedia.org, and cloud-admin@lists.wikimedia.org.
- On wiki: News articles on wikitech, talk page messages on wikitech, talk page messages on "home" wikis
- Direct email
Due to the nature of each channel (real time, async, group, direct, etc) they are often used for different purposes.
TODO: introduce here general rules for picking a channel to send a particular type of communication
Deprecation plans
TODO: add information and examples here. Context: Grid migration, Ubuntu deprecation in CloudVPS/Toolforge, Jessie deprecation, etc.
Expected downtime
It is common that we need to perform operations that introduces downtime for our services or otherwise creates negative impact for our users experience. This section contains detail on how to handle this kind of situations.
General rules:
- If the operation is going to cause downtime to users (any amount), announce it to the mailing lists at least 1 week ahead of the window.
- When communicating operation windows, be precise in which kinds of downtime users should expect (all services down? network failure? database failure?)
- When communicating operation windows, include concrete information on what services are affected (does this affect Cloud VPS but not Toolforge? or the other way around?)
- When communicating operation windows, be very explicit about time and dates. Use an internationally format for dates (YYYY-MM-DD) and the UTC timezone for times.
Recommended timeline:
- 1 week before the operation window: initial announcement email
- 1 minute before the operation window: email letting users know the operation is starting
- 1 minute after operations ended: optional email letting users know the aftermath (not always required)
Examples
Example email sent by Andrew regarding an Openstack upgrade to the cloud-announce mailing list. The email was sent 7 days in advance:
Announce email example |
---|
Subject: cloud-vps maintenance Monday, 2019-10-07 We'll be upgrading the cloud services OpenStack install on Monday, beginning at 14:00 UTC. The entire upgrade process may take a couple of hours. Early on in the process, Horizon (and associated OpenStack APIs) will be disabled (probably for 20 to 30 minutes.) There may also be brief network interruptions during the upgrade, although if all goes well these will not be noticeable by users. Toolforge and existing VMs should be largely unaffected apart from possible network hiccups. |
Example email sent by Arturo regarding an operation to reboot several cloudvirt servers, sent to the cloud-announce mailing list. The email was sent 7 days in advance:
Announce email example |
---|
Hi there, Next Wednesday 2019-10-09 at 09:00 UTC we will be doing a maintenance operation on some of our cloudvirt servers (the hypervisor servers) that involves rebooting both the physical servers and the virtual machines running on them. The reason is that we need to update the running linux kernel version they have. In this window we will reboot 4 hypervisors: * cloudvirt1008 * cloudvirt1009 * cloudvirt1012 * cloudvirt1013 The procedure will be to reboot a server, wait for it to come back online (could take up to 5 minutes) and wait for all the VMs to come back online. Then move to the next server. Toolforge users may see their tools and webservices briefly disrupted due to several components of the Toolforge infrastructure being rebooted in this operation. If nothing changes (reallocated or new virtual machine, etc) this is the list of affected VM instances in each hypervisor: * cloudvirt1008: VM: tools-sgebastion-09 PROJECT: tools VM: tools-k8s-master-01 PROJECT: tools VM: deployment-cache-upload05 PROJECT: deployment-prep VM: toolsbeta-paws-worker-1002 PROJECT: toolsbeta VM: toolsbeta-puppetmaster-02 PROJECT: toolsbeta VM: tools-mail-02 PROJECT: tools VM: tools-prometheus-02 PROJECT: tools VM: tools-elastic-01 PROJECT: tools VM: tracker1 PROJECT: lta-tracker VM: tools-clushmaster-02 PROJECT: tools VM: tools-worker-1020 PROJECT: tools VM: tools-k8s-etcd-01 PROJECT: tools VM: tools-worker-1010 PROJECT: tools VM: tools-worker-1008 PROJECT: tools VM: tools-worker-1007 PROJECT: tools VM: tools-worker-1003 PROJECT: tools VM: tools-sgeexec-0937 PROJECT: tools * cloudvirt1009: VM: toolsbeta-paws-master-01 PROJECT: toolsbeta VM: tools-elastic-02 PROJECT: tools VM: tools-paws-worker-1005 PROJECT: tools VM: tools-prometheus-01 PROJECT: tools VM: tools-paws-worker-1002 PROJECT: tools VM: puppet-lta PROJECT: lta-tracker VM: tools-flannel-etcd-03 PROJECT: tools VM: tools-worker-1017 PROJECT: tools VM: tools-k8s-etcd-02 PROJECT: tools VM: tools-worker-1013 PROJECT: tools VM: tools-worker-1012 PROJECT: tools VM: tools-worker-1009 PROJECT: tools VM: tools-worker-1006 PROJECT: tools VM: tools-worker-1004 PROJECT: tools * cloudvirt1012: VM: tools-paws-master-01 PROJECT: tools VM: deployment-ms-be06 PROJECT: deployment-prep VM: toolsbeta-worker-1001 PROJECT: toolsbeta VM: deployment-cumin02 PROJECT: deployment-prep VM: toolsbeta-k8s-master-01 PROJECT: toolsbeta VM: toolsbeta-k8s-etcd-01 PROJECT: toolsbeta VM: toolsbeta-puppetdb-01 PROJECT: toolsbeta VM: tools-redis-1002 PROJECT: tools VM: tools-paws-worker-1003 PROJECT: tools VM: tools-paws-worker-1001 PROJECT: tools VM: tools-elastic-03 PROJECT: tools VM: tools-worker-1025 PROJECT: tools VM: tools-worker-1026 PROJECT: tools VM: tools-worker-1022 PROJECT: tools VM: tools-worker-1019 PROJECT: tools VM: tools-worker-1018 PROJECT: tools VM: tools-k8s-etcd-03 PROJECT: tools VM: tools-worker-1016 PROJECT: tools VM: tools-flannel-etcd-01 PROJECT: tools VM: tools-worker-1014 PROJECT: tools VM: phlogiston-5 PROJECT: phlogiston VM: dumps-3 PROJECT: dumps VM: codesearch4 PROJECT: codesearch VM: wikispeech-wiki-stretch PROJECT: wikispeech VM: ores-worker-01 PROJECT: ores VM: puppet-jmm-kernel-stretch2 PROJECT: puppet VM: mcr-base PROJECT: mcr-dev VM: rel2 PROJECT: search VM: mc-clusterA-2 PROJECT: test-twemproxy VM: wikibrain-embeddings-02 PROJECT: wikibrain VM: qube-node1 PROJECT: k8splay VM: cindy PROJECT: pluggableauth VM: cvn-apache9 PROJECT: cvn VM: zk1-2 PROJECT: analytics * cloudvirt1013: VM: tools-flannel-etcd-02 PROJECT: tools VM: paws-ext-lb-01 PROJECT: paws VM: abogott-puppetclient PROJECT: testlabs VM: tools-worker-1028 PROJECT: tools VM: tools-worker-1005 PROJECT: tools VM: cloudstore-dev-02 PROJECT: cloudstore VM: cloudstore-puppetmaster-01 PROJECT: cloudstore VM: deployment-aqs03 PROJECT: deployment-prep VM: osmit-test PROJECT: osmit VM: tools-sgewebgrid-lighttpd-0927 PROJECT: tools VM: tools-sgewebgrid-lighttpd-0926 PROJECT: tools VM: tools-sgewebgrid-lighttpd-0925 PROJECT: tools VM: tools-sgewebgrid-lighttpd-0924 PROJECT: tools VM: tools-sgewebgrid-lighttpd-0923 PROJECT: tools VM: tools-sgewebgrid-lighttpd-0922 PROJECT: tools VM: tools-sgewebgrid-lighttpd-0920 PROJECT: tools VM: tools-sgewebgrid-lighttpd-0917 PROJECT: tools VM: tools-sgewebgrid-lighttpd-0909 PROJECT: tools VM: tools-sgeexec-0925 PROJECT: tools VM: tools-sgeexec-0923 PROJECT: tools VM: tools-sgeexec-0910 PROJECT: tools VM: cyberbot-db-01 PROJECT: cyberbot regards. |
Changes tracking
TODO: Do we have internal rules for introducing patches into the several repos we use for work? ops/puppet, dns, etc.
Tickets and work tracking
This section describes our practices using a ticketing system (Phabricator) to track our work and issues related to our systems and services.
See also: mw:Wikimedia Cloud Services team/Team work board practices
Server Admin Logs
This section describes our practices for recording the operations we do, like changes to servers or services.
Logging an operation to a SAL creates a papertrail of what we do. This improves collaboration in a team which is distributed by nature. Also, it helps improving the transparency of our operations.
- Use
!log admin ...
in #wikimedia-cloud connect to log all operations related to the Cloud VPS service in general. This is the SAL for the Cloud VPS service. - Use
!log tools ...
in #wikimedia-cloud connect to log all operations related to the Toolforge service. This is the SAL for the Toolforge service. - Use
!log ...
in #wikimedia-operations connect to log all operations related to physical hardware. This is the SAL for general operations tasks.
TODO: add information. Describe how we use the different !log
mechanism in the different channels.
Personal availability
This section includes information regarding how we try to coordinate to ensure we always have enough human availability to support our services.
TODO: information on vacations, unavailability communication, etc. Does this fit into this document?
On-call and paging
Worth mentioning Wikimedia_Cloud_Services_team/Clinic_duties which we do while on-call.
TODO: include here all we know about how we coordinate paging/on-call, etc? Does this fit into this document?
Additional notes
Please, take these additional notes into consideration.
- This document is intended to be an addition to other standard and industry procedures, and it does not replace them, just adds additional information and tips.
- Changes to this document requires agreement among the affected people (i.e, the WMCS folks).
- We agreed and follow our Technical Engagement Team Social Norms.