Incidents/20140211-Parsoid
Appearance
Summary
Verbose logging combined with broken log rotation led to disks on about 3/4 of the Parsoid nodes filling up, which caused the Parsoid daemons to stop accepting requests. This led to some user-visible errors for VisualEditor users in a 17-minute window. An estimate is that less than 10% of VE page loads / saves were affected during this period.
Timeline
All times UTC on Tuesday the 11th (03:00 UTC = 7pm PST on Monday evening):
03:02 First disk space alerts for wtp* [1] 03:06 First connection refused alert 03:10 <springle> those look real. root full on wtp1008 03:12 Most wtp* servers now refusing connections [2] 03:12 Sean removes log file on wtp1008 and restarts service 03:14 parsoid.svc.eqiad.wmnet LVS check goes CRITICAL, sends pages but not to me 03:16 wtp1008 comes back up 03:17 Roan gets on IRC, having been dragged out of a conversation by Erik 03:20 wtp1021 and the LVS check magically come back up (??) 03:20 Roan saves a copy of wtp1005's log for analysis; later discovered it was the wrong file 03:23 Roan starts a rolling restart of the Parsoid cluster using the command documented on wikitech 03:26 The LVS check goes CRITICAL again; wtp10{01,02,04,10} go down 03:28 Roan uses the old init script to restart Parsoid instead 03:29 Entire Parsoid cluster comes back up
Conclusions
- Log rotation in puppet was not properly tested, and did not run often enough to prevent failures
- Current Parsoid logging via stdout/stderr redirection can block. Work on async logging is ongoing, but was not ready before this outage.
- Disk space monitoring on Parsoid boxes should trigger much earlier
- Need to better check the logging volume in the Parsoid tests (recursion bug in error logging code produced megabytes of log data per error)
- salt restarts were using old init script instead of upstart, see bug
Actionables
- Status: Done - Fix log rotation, run it hourly instead of daily
- Status: Done - Remove old init scripts and update documentation on the log file path
- Status: Done - Lower the warning threshold on parsoid node disk space to provide time to react
- Status: Done - Finish migration to async logging backend in Parsoid so that a full disk does not affect the service availability
- Status: Unresolved - Check the logging volume in Parsoid unit tests, less critical once logging is async