Incidents/2017-04-26 ORES

Summary

Today, an ORES deployment resulted in a big pile of timeout errors in CODFW, but not in EQIAD. It looks like uwsgi was running old code while celery was running new code. No service restarts rectified the situation. So we switched traffic to point to EQIAD instead. The problem was then resolved.

Timeline

2034 UTC: ORES deployment is completed
2041 UTC: Icinga warns of an outage
2048 UTC: Halfak confirms that a bunch of requests are timing out CODFW but EQIAD is doing OK.
2102 UTC: Phab:T163944 is created to track the issue.
2114 UTC: A service restart is issued via scap and directly (by mutante) via systemctl "100.0% (6/6) success ratio (>= 100.0% threshold) for command: 'systemctl restart uwsgi-ores"
2135 UTC: mutante posts a patchset to re-route traffic to eqiad. (https://gerrit.wikimedia.org/r/#/c/350487/)
2143 UTC: the patch is merged and puppet is run
2145 UTC: everything is OK again

Conclusions

No idea what could have caused this. Filed a task to investigate. T163950 -- Investigate failed deploy to CODFW

Update @ 2017-05-02: The problem was caused by scb2005 and scb2006. We had no idea they existed. They weren't in our scap config so new code wasn't getting deployed to them. The problem was solved once we added them to the scap config and deployed to them.

Actionables

T163950 -- Investigate failed deploy to CODFW