Incidents/20140619-parsercache/IRC log
Appearance
IRC conversion on #wikimedia-operations during the 2014-06-19 pc1002 outage. All times are UTC.
Trimming and annotations made by Antoine "hashar" Musso.
[19:00:19] <logmsgbot> !log reedy Finished scap: scap 1.24wmf10 take 2... (duration: 22m 59s) [19:00:24] <morebots> Logged the message, Master [19:03:05] <grrrit-wm> (PS1) Reedy: Wikipedias to 1.24wmf9 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/140769 [19:03:51] <grrrit-wm> (CR) Reedy: [C: 2] Wikipedias to 1.24wmf9 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/140769 (owner: Reedy) [19:04:28] <grrrit-wm> (Merged) jenkins-bot: Wikipedias to 1.24wmf9 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/140769 (owner: Reedy) [19:06:27] <logmsgbot> !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.24wmf9 [19:09:15] <grrrit-wm> (PS1) Reedy: group0 to 1.24wmf10 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/140770 # And here PC1002 is reported dieing: [19:09:51] <icinga-wm> PROBLEM - check configured eth on pc1002 is CRITICAL: Timeout while attempting connection [19:10:01] <icinga-wm> PROBLEM - Disk space on pc1002 is CRITICAL: Timeout while attempting connection [19:10:01] <icinga-wm> PROBLEM - check if dhclient is running on pc1002 is CRITICAL: Timeout while attempting connection [19:10:11] <icinga-wm> PROBLEM - MySQL disk space on pc1002 is CRITICAL: Timeout while attempting connection [19:10:11] <icinga-wm> PROBLEM - mysqld processes on pc1002 is CRITICAL: Timeout while attempting connection [19:10:11] <icinga-wm> PROBLEM - DPKG on pc1002 is CRITICAL: Timeout while attempting connection [19:10:11] <icinga-wm> PROBLEM - RAID on pc1002 is CRITICAL: Timeout while attempting connection # First users report (icinga still beat them by a few seconds yeah!) [19:10:26] <Excirial> Fyi: Just received several 503's on Enwiki. [19:10:36] <twkozlowski> same on Commons, peeps are reporting [19:11:13] <jorm> all wikis. [19:11:31] <Reedy> Sweet [19:11:41] <icinga-wm> PROBLEM - Host pc1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:11:48] <Reedy> And that'll be why # Root cause analysis performed by Reedy in less than 30 seconds. # Then apaches dies... [19:11:51] <icinga-wm> PROBLEM - Apache HTTP on mw1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:11:51] <icinga-wm> PROBLEM - Apache HTTP on mw1071 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:11:51] <icinga-wm> PROBLEM - Apache HTTP on mw1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds .... # Debugging time! [19:12:27] <RobH> checking pc1002 [19:12:56] <RobH> seems its borked completely and unresponsive to ssh [19:12:59] <RobH> checking mgmt [19:13:28] <mutante> oom death [19:13:32] <mutante> powercycle [19:13:35] <RobH> im rebooting it # Indeed, not much we can do since the box is dead. [19:13:36] <ori> Reedy: i'd revert [19:13:38] <ori> before it boots up # If there is something we do properly it is reverting. Proposed less than 2 # minutes after the cause is acked. [19:14:20] <ori> Reedy: it's going to boot with a cold, empty memcached [19:14:26] <ori> if it OOMd there's a good chance it's related to the deploy [19:14:37] <Reedy> isn't it mysql? [19:14:52] <ori> better start populating the cache using known-good mw code # And reverting... [19:15:16] <logmsgbot> !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias back to 1.24wmf8 [19:15:31] <icinga-wm> RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time [19:15:31] <icinga-wm> RECOVERY - Apache HTTP on mw1070 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.073 second response time [19:15:48] <RobH> !loh powercycled pc1002 [19:16:00] <ori> coming back up # It came back because of the revert. At that point mysql was not running on pc1002. [19:16:36] <RobH> it will be a few minutes [19:22:10] <Reedy> Good to see https://wikitech.wikimedia.org/wiki/Parser_cache is up to date # points at bug 1. We have no documentation at all for the parser cache. [19:22:32] <Reedy> 146 Fatal error: Call to a member function real_escape_string() on a non-object in /usr/local/apache/common-local/php-1.24wmf9/includes/db/DatabaseMysqli.php on line 289 [19:22:41] <icinga-wm> RECOVERY - Host pc1002 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [19:22:41] <mutante> pc1002 login: # pc1002 finally back up [19:29:46] <RobH> its back up [19:29:51] <RobH> well, its responsive to ssh now [19:29:54] <RobH> (im on it) [19:30:07] <_joe_> RobH: is mysql up? [19:31:04] <RobH> nope, wont fire [19:31:12] <RobH> _joe_: if you know about the server, hop on ;] [19:31:48] <_joe_> I don't even know wich one it is [19:31:52] <ori> pc1002 [19:31:56] <_joe_> I can try to troubleshoot mysql not starting [19:31:58] <mutante> !log started mysql on pc1002 [19:31:59] <springle> wtf [19:32:01] <icinga-wm> RECOVERY - mysqld processes on pc1002 is OK: PROCS OK: 1 process with command name mysqld [19:32:03] <springle> what happened [19:32:04] <morebots> Logged the message, Master # Summary for springle (DBA) [19:32:28] <ori> springle: wmf10 to all wikis -> pcs1002 unresponsive -> apache busy workers -> 503s [19:32:29] <mutante> RobH: /etc/init.d/mysql start [19:32:31] <RobH> cuz normal upstart didnt do it # That should be fixed. [19:32:43] <ori> springle: pcs1002 hard reboot, mysql on pcs1002 did not come up [19:32:48] <mutante> it first said something about the pid file existing [19:32:49] <ori> springle: that's where we're at [19:32:50] <mutante> but it's not set [19:33:15] <chasemp> ori springle then mutante did something not mysql is up there :) [19:33:19] <mutante> springle: [19:33:21] <mutante> root@pc1002:~# /etc/init.d/mysql status * MySQL is not running, but PID file exists [19:33:27] <mutante> root@pc1002:~# /etc/init.d/mysql status * MySQL is not running [19:33:33] <bd808> ori, springle: looks like it may have started with many "DBError: DB connection error: Too many connections (10.64.16.157)" events [19:33:35] <springle> mysql is recovering [19:33:35] <mutante> root@pc1002:~# /etc/init.d/mysql status * MySQL running (6031) [19:34:36] <Reedy> springle: I told you we needed to replace it with MongoDB [19:35:23] <springle> Reedy: :) [19:35:44] <ori> so, DB connection error: Too many connections [19:36:18] <ori> on rollout of new release [19:36:19] <ori> is consistent with a run on the parser cache DBs due to massive cache invalidation [19:36:31] <Reedy> First time we've seen this... [19:36:35] <Reedy> Or at least, in a long time anyway [19:36:37] <_joe_> looking at the logs... mysql server became unresponsive/slow, connection piled up, this blocked workers in apache [19:36:56] <_joe_> probably this has happened before any failover mechanism kicked in [19:37:06] <robla> Reedy: yeah....we've had at least one parser cache outage that almost went unnoticed, so this is a weird new development [19:37:08] <bd808> sql-bagostuff events in logstash -- https://logstash.wikimedia.org/#dashboard/temp/S9cjGaUCRriwaPpk9xNMCw ....