Jump to content

Incident response/Runbook

From Wikitech

This is a brief, at-a-glance description of what steps to take when responding to an on-going incident.

Don’t panic. Even when the wikis are down, you have time to communicate.

If you’ve been paged

  • Stop everything else you’re doing. If you can, respond even if you’re not at your desk.
  • Speak up in #wikimedia-operations to say you got the page and you’re looking at it. Read up in that channel for context.
  • If the alert is a clear false alarm, you can stop here.
  • If the alert may be caused by a (D)DoS or other attack or security issue, move to #mediawiki_security. If there’s too much alert noise, move to #wikimedia-sre. Otherwise, stay in #wikimedia-operations connect.
  • Every genuine page needs an Incident Coordinator (IC). If you're an SRE and there's no IC yet, you should become the IC.
  • If 3 or more people are responding, this is an incident and it needs an IC.
  • If you are oncall and the other oncall person is available, agree on who will take IC and who will do the troubleshooting.
  • If you are oncall and the other oncall person is unavailable, alert others by mentioning #page in the #wikimedia-operations IRC channel and do the troubleshooting until other engineers are available, at this point take IC
  • Acknowledge either the Icinga or Alertmanager alert(s).
  • Acknowledge the incident on Victorops

If there was no page, but...

  • If the issue affects users, and three or more people are working on it, there should be an IC.
  • If the issue needs continuous attention, so you’ll be handing it off until it’s resolved, there should be an IC.
  • If you’re not sure whether there should be an IC, it’s better to have one. If it turns out to be unnecessary, you can stop later.
  • If you're an SRE and there's no IC yet when one is needed, you should become the IC. Alert others by mentioning #page in IRC, and proceed below.

To become the Incident Coordinator (IC)

  • If there is an offgoing IC, ensure that you are both in agreement about the handoff.
  • Announce in IRC, “I am the IC.” You are now the IC.
  • If there’s not yet a status doc, start one by making a copy of the template (File -> Make a copy).
  • Update the status doc to say “IC: <your name and IRC nick>” and add the IC handoff to the timeline.
  • If it's not already, put a link to the status doc in the topic of #mediawiki_security, along with a few words identifying the incident (“foobaroid OOMs”) or at least the date.

When you are the IC

  • Communicate, don’t deep dive. Resist the temptation to troubleshoot the issue; let others do that. Your job is to keep the big picture. If you’re uniquely suited to solve the problem yourself, hand off the IC role to someone else.
  • Keep track of what needs to be done, and what everyone is working on. Assign tasks as needed to make sure everything is covered and no one is doing conflicting work.
  • Set a timer (or delegate another user): every half hour, make sure to:
  • Ask questions. It’s important for you to be fully informed, and it’s also likely that if you don’t know the answer, others don’t either.
    • If you’re not sure what someone is doing, ask them.
    • If someone was investigating a question and you never saw an answer, follow up.
    • If the team agrees “we should do X,” ask who is going to do it -- or assign it to someone.
  • Using the guidelines on officewiki, evaluate whether you need to notify SRE Directors, Legal, Comms, or WMF leadership. If so, either contact Directors yourself or assign someone to do so.
  • Continue to actively work as the IC until you hand off the role to a specific person or until the incident is over.

When you are not the IC

  • Watch IRC while you work. If others are talking to you, make sure you’ll know.
  • Talk in IRC while you work. Don’t take any action without announcing it first. Keep the channel free of unnecessary chatter during the incident.
  • Log your actions to the SAL. It’s better to log too much than too little.
    • In #wikimedia-operations, say !log Restarted foobaroid on xyz1234.
    • If the incident is security-sensitive, instead use !log-private in #mediawiki_security for visibility, even though it doesn’t actually log anywhere.
  • If you need more people to help you, tell the IC.
  • If you have a question no one has asked, or you know something no one is talking about, speak up -- even if you think someone must have thought of it already.
  • After one person has been the IC for several hours now, or if it’s near the end of their workday, consider asking them if they would like a replacement IC.

To hand over the IC role to another person

  • If the incident is in progress, you are the IC until someone takes over from you.
  • Make sure the status doc is up-to-date with everything you know.
  • Make sure the new IC has a full understanding of the situation so far: what’s known, what’s unknown, and who’s working on what.
  • Make sure they know they are the IC.
  • Make sure they update IRC and the status doc to show they’re the IC.
  • You are no longer the IC. Good job!

To resolve the incident and stop being IC

  • Even if there’s still work to do, you may not need an IC if that work is no longer urgent. When remaining tasks can wait until normal working hours, the IC can end the incident.
  • Update the status doc with everything you know. Remind others to do the same. This is much easier now than it will be later. Update the incident status to “resolved.”
  • Make sure unfinished work is tracked in Phabricator, and tasks are linked from the doc.
  • Announce in IRC, “I am resolving the incident.” Mention the status of any continuing issues. Make sure to update each channel where the incident was discussed.
  • You are no longer the IC. Good job!
  • Make sure there are no pending incident reports about outages that happened during your shift

Writing an incident report

As an IC, you own making sure that all incidents that happened during your shift are correctly filed and scored:

  • If the topic is of a sensitive manner (PII or security-related) then keep the incident status in Google docs.
  • As a person with first account experience on the incident, you should be in a good position to write the initial version of the report, as you will have a good overview of the incident and its evolution over time, even if you are not an expert on the service.
  • For areas you don't have the expertise on, the suggested course of action is finding a subject matter expert to help fill in the details. This can mean co-writing or delegating, depending on the situation.
    • If a subject matter expert is not available, or it's not obvious who they might be, contact a manager of the team most likely to own the service.
  • Unless further research is needed, having a report early, while details are fresh in one's mind is highly encouraged - refining can be done later on during the review stage.
  • If for some reason you cannot file the report (e.g. you go on vacation) make sure to find someone to do it for you (e.g. the other person on call with you)
  • For the next SRE meeting, add a bullet to the SRE meeting notes for awareness

Deciding whether to contact others & how to contact others

Information on when and how to involve other teams in WMF is on officewiki, since it includes staff members' contact information.