Deployments/Emergencies
Appearance
If you're looking for help with an emergency situation, please first try to contact Release Engineering & SRE on libera.chat in #wikimedia-operations connect. If that fails, it may be appropriate to use Klaxon.
Emergency deployments happen when things need fixing right now, even though deployments aren't happening right now.
How to
🚨 Step-by-step – to do an emergency release you must:
- Join #wikimedia-operations connect on libera.chat
- Get positive confirmation from SRE before deployment, and inform Release Engineering that you need to deploy (see the template below)
- Have someone able to deploy your change
Ways to find a deployer:
- Ask if there is someone available to deploy in #wikimedia-operations connect on libera.chat (see the template below)
- Message Tyler (thcipriani) and/or the person assigned to this week's train
- Include a link to the patch you'd like to deploy (and the task if appropriate)
IRC message Template
I need an emergency deploy for https://gerrit.wikimedia.org/r/1234 -- context is T1234, are SRE ok with a deployment? (cc: thcipriani [INSERT WEEKLY TRAIN CONDUCTOR NAME]). I (already have|need) someone to deploy.
Reasons for an emergency deploy
- Address security issues
- For example, a mis-configuration once meant that a private wiki and all of its content was accidentally made public.
- Avoid data loss / corruption
- For example, a coding error meant that newly-painted pages were being cached in a corrupted form; the longer it went, the more of the site was wrong.
- Maintain availability
- For example, a new feature proved much more popular than planned and the extra load it was causing was threatening to take down the site, so it was temporarily disabled over a holiday, until people were back at work.
- Prevent abuse
- For example, a massive content scraping run from a search engine wasn't responding to automated HTTP 429 speed bumps and so had to be manually blocked until they could adjust their code.
- Major loss of functionality / appearance
- For example, a code efficiency change broke the visual appearance and usability of parts the sites for a large number of logged-out users, and so the change was reverted out of production until it could be fixed.
For deployers
- Rollback first, fix later; maintaining an overall service to our users is the most important focus.
- Prioritise general availability over that of new features; we have a billion readers and only a few users of your new tool, no matter how cool.
- Make on-wiki edits rarely, and only when you really have to; each wiki's editing community expects autonomy.