Incidents/2017-04-26 ORES
Appearance
Summary
Today, an ORES deployment resulted in a big pile of timeout errors in CODFW, but not in EQIAD. It looks like uwsgi was running old code while celery was running new code. No service restarts rectified the situation. So we switched traffic to point to EQIAD instead. The problem was then resolved.
Timeline
- 2034 UTC
- ORES deployment is completed
- 2041 UTC
- Icinga warns of an outage
- 2048 UTC
- Halfak confirms that a bunch of requests are timing out CODFW but EQIAD is doing OK.
- 2102 UTC
- Phab:T163944 is created to track the issue.
- 2114 UTC
- A service restart is issued via scap and directly (by mutante) via systemctl "100.0% (6/6) success ratio (>= 100.0% threshold) for command: 'systemctl restart uwsgi-ores"
- 2135 UTC
- mutante posts a patchset to re-route traffic to eqiad. (https://gerrit.wikimedia.org/r/#/c/350487/)
- 2143 UTC
- the patch is merged and puppet is run
- 2145 UTC
- everything is OK again
Conclusions
No idea what could have caused this. Filed a task to investigate. T163950 -- Investigate failed deploy to CODFW
- Update @ 2017-05-02
- The problem was caused by scb2005 and scb2006. We had no idea they existed. They weren't in our scap config so new code wasn't getting deployed to them. The problem was solved once we added them to the scap config and deployed to them.