Incidents/2018-04-10 Routing
Appearance
(Redirected from Incident documentation/20180410-Routing)
Summary
A configuration change on routers located in the Ashburn and Singapore datacenters caused a service interruption of ~10min (22:53-23:03UTC) for users redirected to Ashburn, and ~40min for users redirected to Singapore. (22:47-23:24 UTC)
More details on: task T191940
Timeline
- 22:47 Change pushed to cr1-eqsin
- 22:53 Change pushed to cr2-eqiad
- 22:58 cr2-eqiad rolled-back
- 23:03 eqiad full recovery (after routing convergence)
- 23:22 cr1-eqsin rolled-back (partial recovery)
- 23:31 eqsin de-pooled
- 23:36 eqsin full recovery
Conclusions
- Changes, even if already live in part of the infrastructure, need to be better discussed with the team
- POPs (especially non redundant ones) should be depooled before applying changes, if any doubt
- The same change had different results across the deployment:
- No issues, working as expected (eg. switches, cr2-esams)
- Partial failure (cr1-eqsin), connectivity to the router and rpd appeared in a healthy state, user traffic was being dropped
- Full failure (cr2-eqiad), instantly lost connectivity to the router
Actionables
Explicit next steps to prevent this from happening again as much as possible, with Phabricator tasks linked for every step.
- Tickets have been opened with the vendor phab:T191667 (update: crash reason found)