Jump to content

Release Engineering/SAL/Archive 2

From Wikitech

2016-12-27

2016-12-26

  • 12:09 hashar: beta: restarted varnish.service and varnish-frontend.service on deployment-cache-text04

2016-12-24

2016-12-23

2016-12-22

  • 22:11 thcipriani: disable production l10nupdate for deployment freeze

2016-12-21

  • 05:57 Krinkle: Jenkins "Collapsing Console Sections" for PHPUnit was broken since "-d zend.enable_gc=0" was added to phpunit.php invocation. Updated pattern in Jenkins system configuration.

2016-12-19

2016-12-16

  • 22:34 legoktm: deploying https://gerrit.wikimedia.org/r/327202
  • 14:33 hashar: Nodepool Image ci-jessie-wikimedia-1481897950 in wmflabs-eqiad is ready
  • 14:25 hashar: Nodepool Image ci-trusty-wikimedia-1481897961 in wmflabs-eqiad is ready
  • 14:19 hashar: Refreshing Nodepool images. The snapshots were broken due to mariadb-client failing to upgrade
  • 13:45 hashar: integration / contintcloud : remove security rules of labs projects that allowed gallium (phased out) T95757
  • 13:44 hashar: integration / contintcloud : update security rules of labs projects to allow contint2001
  • 13:15 hashar: integration: update sudo policy for debian-glue to keep the env variable SHELL_ON_FAILURE (for https://gerrit.wikimedia.org/r/#/c/327720/ )
  • 10:15 hashar: integration: apt-get upgrade on all permanent slaves
  • 10:13 hashar: integration-slave-docker-1000 changed docker::version from no more existent '1.12.3-0~jessie' to simply 'present'. Will have to manually upgrade it from now on. T153419
  • 10:04 hashar: deployment-puppetmaster02 updated puppet repo. Was stall due to a bump of the mariadb submodule

2016-12-15

  • 21:00 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/324368
  • 19:23 marxarelli: Manually rebasing and re-applying cherry picks for operations/puppet on integration-puppetmaster01.eqiad.wmflabs
  • 16:08 hashar: deployment-phab02 : apt-get upgrade T147818
  • 14:48 Amir1: ladsgroup@deployment-tin:~$ mwscript updateCollation.php --wiki=fawiki (T139110)
  • 11:41 zeljkof: Reloading Zuul to deploy 327473

2016-12-14

  • 12:38 elukey: created deployment-copper on deployment-prep as temporary test

2016-12-13

2016-12-09

2016-12-08

2016-12-07

  • 15:04 hashar: Image ci-trusty-wikimedia-1481122712 in wmflabs-eqiad is ready T117418
  • 02:29 matt_flaschen: foreachwikiindblist FlowFixInconsistentBoards complete
  • 02:27 matt_flaschen: Started (foreachwikiindblist flow.dblist extensions/Flow/maintenance/FlowFixInconsistentBoards.php) 2>&1 | tee FlowFixInconsistentBoards_2016-12-06.txt on deployment-tin

2016-12-06

  • 21:20 hashar: Image ci-jessie-wikimedia-1481058839 in wmflabs-eqiad is ready T113342
  • 21:13 hashar: Refresh Nodepool Jessie snapshot which boot 3 times faster. Will help get nodes available faster T113342
  • 16:33 hashar: Nodepool imported a new Jessie image 'jessie-T113342' with some network configuration hotfix. Will use for debugging. T113342
  • 09:08 Reedy: running foreachwiki update.php on beta

2016-12-05

  • 20:43 hashar: Image ci-jessie-wikimedia-1480969940 in wmflabs-eqiad is ready (include trendingedits::packages which explicitly define the installation of librdkafka-dev' )
  • 09:52 elukey: add https://gerrit.wikimedia.org/r/#/c/324642/ to the deployment-prep's puppet master to test nutcracker
  • 09:39 hashar: beta-update-databases-eqiad fails due to CONTENT_MODEL_FLOW_BOARD not registered on the wiki. T152379
  • 08:44 hashar: Image ci-jessie-wikimedia-1480926961 in wmflabs-eqiad is ready T113342
  • 08:35 hashar: Pushing new Jessie image to Nodepool that is supposedly boot 3x times faster T113342

2016-12-04

  • 15:25 Krenair: Found a git-sync-upstream cron on deployment-mx for some reason... commented for now, but wtf was this doing on a MX server?

2016-12-03

2016-12-02

  • 14:40 hashar: added Tobias Gritschacher to Gerrit "integration" group so he can +2 patches on integration/* repositories \O/

2016-12-01

2016-11-30

  • 17:22 gehel: restart of logstash on deployment-logstash2 - upgrade to Java 8 - T151325
  • 17:11 gehel: rolling restart of deployment-elastic0* - upgrade to Java 8 - T151325
  • 11:22 hashar: Gerrit hide mediawiki/extensions/JsonData/JsonSchema Empty since 2013
  • 11:20 hashar: Gerrit made mediawiki/extensions/GuidedTour/guiders read-only (per README.md, no more used)
  • 11:18 hashar: Gerrit mediawiki/extensions/CentralNotice/BannerProxy.git Empty since 2014

2016-11-29

  • 15:23 hashar: Image ci-jessie-wikimedia-1480432368 in wmflabs-eqiad is ready
  • 14:30 hashar: Image ci-trusty-wikimedia-1480429423 in wmflabs-eqiad is ready T151879
  • 14:24 hashar: Refreshing Nodepool Trusty snapshot to get php5-xsl installed T151879

2016-11-28

2016-11-26

  • 16:15 Reedy: killed /srv/jenkins-workspace/workspace/mediawiki-core-*/src and /srv/jenkins-workspace/workspace/mwext-*/src from integration slaves to get rid of borked MW dirs
  • 15:51 Reedy: deleted /srv/jenkins-workspace/workspace/mediawiki-core-code-coverage/src on integration-slave-trusty-1006 to force a reclone
  • 14:14 Reedy: moved old /srv/mediawiki-staging/php-master to /tmp/php-master, recloned MW Core, copied in LocalSettings, skins, vendor and extensions. T151676. scap sync-dir running
  • 13:05 Reedy: marked deployment-tin as offline due to T151670

2016-11-24

2016-11-23

  • 15:04 Krenair: fixed puppet on deployment-cache-text04 by manually enabling experimental apt repo, see T150660
  • 10:57 hashar: Terminating deployment-apertium01 again T147210

2016-11-22

  • 19:31 hashar: beta: rebased puppet master
  • 19:30 hashar: beta: dropping cherry pick for the PDF render by mobrovac ( https://gerrit.wikimedia.org/r/#/c/305256/ ). Got merged
  • 08:29 hashar: Deleting shut off instances: integration-puppetmaster , deployment-puppetmaster , deployment-pdf02 , deployment-conftool - T150339

2016-11-21

2016-11-19

2016-11-18

2016-11-17

  • 22:07 mutante: re-enabled puppet on contint1001 after live Apache fix
  • 11:34 hasharLunch: Deleted instance deployment-apertium01 . Was Trusty and lacked packages, replaced by a Jessie one ages ago. T147210

2016-11-16

  • 20:53 elukey: restored apache2 config on deployment-mediawiki06
  • 20:28 elukey: temporary increasing verbosity of mod_rewrite on deployment-mediawiki06 as test
  • 20:02 Krenair: mysql master back up, root identity is now unix socket based rather than password
  • 19:57 Krenair: taking mysql master down to fix perms
  • 13:02 hashar: Restarted HHVM on deployment-mediawiki05 was not honoring requests T150849
  • 12:24 hashar: beta: created dewiktionary table on the Database slave. Restarted replication with START SLAVE; T150834 T150764
  • 10:39 hashar: Removing revert b47ce21 from deployment-tin and reenabling jenkins job. https://gerrit.wikimedia.org/r/321857 will get it fixed
  • 10:26 hashar: Reverting mediawiki/core b47ce21 on beta cluster T150833
  • 09:51 hashar: marking deployment-tin offline so I can live hack mediawiki code / scap for T150833 and T15034
  • 09:12 hashar: deployment-mediawiki04 stopping hhvm
  • 09:12 hashar: deployment-mediawiki04 stopping hhv
  • 08:59 hashar: beta database update broken with: MediaWiki 1.29.0-alpha Updater\n\nYour composer.lock file is up to date with current dependencies!
  • 07:52 Krenair: the new mysql root password for -db04 is at /tmp/newmysqlpass as well as in a new file in the puppetmaster's labs/private.git
  • 06:34 twentyafterfour: restarting hhvm on deployment-mediawiki04
  • 06:33 Amir1: ladsgroup@deployment-mediawiki05:~$ sudo service hhvm restart
  • 06:30 mutante: restarting hhvm on deployment-mediawiki06

2016-11-15

  • 16:03 hasharAway: adding thcipriani to the labs "git" project maintained by paladox

2016-11-14

  • 08:16 Amir1: cherry-picking 321096/3 in beta puppetmaster

2016-11-12

  • 14:02 Amir1: cherry-picked gerrit change 321096/2 in puppetmaster

2016-11-11

2016-11-10

  • 09:33 hashar: Image ci-jessie-wikimedia-1478770026 in wmflabs-eqiad is ready
  • 09:26 hashar: Regenerate Nodepool base image for Jessie and refreshing snapshot image

2016-11-09

  • 20:27 Krenair: removed default SSH access from production host 208.80.154.135, the old gallium IP
  • 16:34 Reedy: deployment-tin no longer offline, jenkins running jobs now
  • 16:11 Reedy: marking deployment-tin.eqiad as offline to test -labs -> beta config rename

2016-11-08

  • 10:23 hashar: refreshing all jenkins jobs to clear out potential live hack I made but can't remember on which jobs I did

2016-11-07

  • 14:01 gilles: Pointing deployment-imagescaler01.eqiad.wmflabs' puppet to puppetmaster.thumbor.eqiad.wmflabs

2016-11-04

  • 13:20 hashar: gerrit: created mediawiki/extensions/PageViewInfo.git and renamed user group extension-WikimediaPageViewInfo to extension-PageViewInfo T148775
  • 12:57 hashar: Image ci-jessie-wikimedia-1478263647 in wmflabs-eqiad is ready (bring in java for maven projects)
  • 12:49 dcausse: deployment-prep reloading nginx on deployment-elastic0[5-7] to fix ssl cert issue
  • 09:28 hashar: Delete integration-slave-jessie-1003 , only have a few jobs running on permanent Jessie slaves - T148183
  • 09:26 hashar: Delete zuul-dev-jessie.integration.eqiad.wmflabs was for testing Zuul on Jessie and it works just fine on contint1001 :] T148183
  • 09:25 hashar: Delete integration-slave-trusty-1012 one less permanent slave since some load has been moved to Nodepool T148183
  • 09:24 hashar: Delete integration-slave-trusty-1016 not pooled in Jenkins anymore T148183

2016-11-03

  • 15:05 Amir1: deploy 0caa589 in ores to deployment-sca03
  • 14:52 Amir1: deploying ores 0caa589 in deployment-sca03
  • 11:32 hashar: deployment-apertium01 manually cleared puppet.conf
  • 11:29 hashar: deployment-apertium01 fails puppet du to wrong certificate bah
  • 07:22 Krenair: fiddled with jenkins jobs in mediawiki-core-doxygen-publish to try to get stuff moving in the postmerge queue again
  • 05:04 Krenair: beginning to move the rest of beta to the new puppetmaster
  • 01:53 mutante: followed instructions at https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Gearman_deadlock
  • 01:53 mutante: disabling and re-enabling gearman, zuul is not working and could be gearman deadlock

2016-11-02

  • 22:06 hashar: hello stashbot
  • 18:51 Krenair: armed keyholder on -tin and -mira
  • 18:50 Krenair: started mysql on -db boxes to bring beta back online
  • 10:54 hashar: Image ci-jessie-wikimedia-1478083637 in wmflabs-eqiad is ready
  • 10:47 hashar: Force refresh Nodepool snapshot for Jessie so it get doxygen included T119140

2016-11-01

  • 22:22 Krenair: started mysql on -db03 to hopefully pull us out of read-only mode
  • 22:21 Krenair: started mysql on -db04
  • 22:19 Krenair: stopped and started udp2log-mw on -fluorine02
  • 22:10 hashar: Armed keyholder on deployment-tin . Instance had 20 minutes uptime and apparently keyholder does not self arm
  • 22:00 Krenair: started moving nodes back to the new puppetmaster
  • 02:55 Krenair: Managed to mess up the deployment-puppetmaster02 cert, had to move those nodes back

2016-10-31

  • 20:57 Krenair: moving some nodes to deployment-puppetmaster02
  • 16:57 bd808: Added Niharika29 as project member

2016-10-27

  • 20:51 hashar: reboot integration-puppetmaster01
  • 18:50 bd808: stashbot has replaced qa-morebots in this channel as the sole bot handling !log messages
  • 18:46 bd808: Testing dual page wiki logging by stashbot. (check #3)
  • 18:36 bd808: !log deployment-prep Testing dual page wiki logging by stashbot. (second attempt)
  • 18:14 bd808: !log deployment-prep Testing dual page wiki logging by stashbot.
  • 10:30 hashar: integration: on Trusty slaves, remove jenkins-deploy from KVM which is only needed for Android testing for T149294: salt -v '*slave-trusty*' cmd.run 'deluser jenkins-deploy kvm'
  • 10:29 hashar: integration: on Trusty slaves, remove jenkins-deploy from KVM which is only needed for Android testing: salt -v '*slave-trusty*' cmd.run 'groupdeluser jenkins-deploy kvm'
  • 10:25 hashar: integration: purge Android packages from Trusty slaves for T149294 : salt -v '*slave-trusty*' cmd.run 'apt-get --yes remove --purge gcc-multilib lib32z1 lib32stdc++6 qemu'

2016-10-25

2016-10-24

  • 16:19 andrewbogott: upgrading deployment-puppetmaster to puppet 3.8.5 packages
  • 09:14 hashar: rebasing integration puppet master

2016-10-21

  • 09:42 gehel: decommission of deployment-elastic08 - T147777

2016-10-20

2016-10-14

  • 21:13 matt_flaschen: Ran START SLAVE to restart replication after columns created directly on replica were deleted.
  • 20:53 bd808: Dropped lu_local_id, lu_global_id from replica db which were added improperly
  • 20:37 matt_flaschen: Applied CentralAuth's patch-lu_local_id.sql migration for T148111, to sql --write
  • 20:09 bd808: Applied CentralAuth's patch-lu_local_id.sql migration for T148111
  • 11:30 dcausse: deployment-prep running sudo update-ca-certificates --fresh on deployment-ton to fix curl error code 60 in cirrus maint script (T145609)

2016-10-13

  • 21:21 hashar: Deleted CI slaves integration-slave-jessie-1004 integration-slave-jessie-1005 integration-slave-trusty-1013 integration-slave-trusty-1014 integration-slave-trusty-1017 integration-slave-trusty-1018
  • 20:12 hashar: Switching composer-hhvm / composer-php55 to Nodepool https://gerrit.wikimedia.org/r/#/c/306727/ T143938
  • 16:23 gilles: Resetting to 61a9cd1f47c5aec8ded92f2486ce43309b9e3e03 on deployment-puppetmaster
  • 16:06 godog: add settings to duplicate traffic to thumbor in beta and restart swift-proxy
  • 16:03 gilles: Cherry-picking https://gerrit.wikimedia.org/r/#/c/315648/ on deployment-puppetmaster
  • 15:35 gilles: Resetting to 61a9cd1f47c5aec8ded92f2486ce43309b9e3e03 on deployment-puppetmaster
  • 14:38 gilles: Cherry-picking https://gerrit.wikimedia.org/r/#/c/315234/5 on deployment-puppetmaster
  • 14:34 gilles: Resetting to 61a9cd1f47c5aec8ded92f2486ce43309b9e3e03 on deployment-puppetmaster
  • 14:32 gilles: Cherry-picking https://gerrit.wikimedia.org/r/#/c/315234/4 on deployment-puppetmaster
  • 14:32 gilles: Resetting to 61a9cd1f47c5aec8ded92f2486ce43309b9e3e03 on deployment-puppetmaster
  • 14:27 gilles: Cherry-picking https://gerrit.wikimedia.org/r/#/c/315234/ on deployment-puppetmaster
  • 14:22 gilles: Resetting to 61a9cd1f47c5aec8ded92f2486ce43309b9e3e03 on deployment-puppetmaster
  • 13:42 gilles: Cherry picking https://gerrit.wikimedia.org/r/#/c/315248/ on deployment-puppetmaster

2016-10-12

2016-10-11

  • 21:35 hasharAway: Force pushed Zuul patchqueue 5628f95...fc6a118 HEAD -> patch-queue/debian/precise-wikimedia
  • 14:37 hashar: Mysql was down on Precise slaves. Apparently rebooted 17 days ago and I guess mysql does not spawn on boot. Restarted mysql on all Precise via: salt -v '*slave-precise*' cmd.run 'start mysql'
  • 09:35 godog: reboot deployment-imagescaler01 to enable memory cgroup
  • 08:29 hashar: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/#/c/313387/ Filter out refs/meta/config from all pipelines T52389

2016-10-10

  • 15:45 dcausse: deployment-prep deployment-elastic0[5-8]: reduce the number of replicas to 1 max for all indices

2016-10-07

  • 20:10 hashar: Created repository.integration.eqiad.wmflabs to play/Test Sonatype Nexus
  • 20:10 hashar: rebooting integration-puppetmaster01
  • 07:55 hashar: Upgrading Nodepool image for Jessie

2016-10-06

  • 14:45 hashar: deployment-mira disarmed/rearmed keyholder in an attempt to clear a Shinken alarm
  • 12:16 hashar: Jenkins slave deployment-tin.eqiad , removing label "deployment-tin.eqiad" it has "BetaClusterBastion" and all jobs are bound to it already

2016-10-05

  • 19:33 andrewbogott: removing mediawiki::conftool from deployment-mediawiki04, deployment-mediawiki06, deployment-mediawiki05

2016-10-04

  • 19:43 andrewbogott: removed contint::slave_scripts and associated files from deployment-sca01 and deployment-sca02
  • 16:22 bd808: Restarted puppetmaster process on deployment-puppetmaster
  • 16:20 bd808: deployment-puppetmaster: removing cherry-pick of https://gerrit.wikimedia.org/r/#/c/305256/; conflicts with upstream changes
  • 15:01 godog: shutdown deployment-poolcounter02, replaced by deployment-poolcounter04 - T123734
  • 09:03 hashar: Regenerating configuration of all Jenkins job due to https://gerrit.wikimedia.org/r/#/c/313306/
  • 01:14 twentyafterfour: New scap command line autocompletions are now installed on deployment-tin and deployment-mira refs T142880

2016-10-03

  • 22:40 thcipriani: manual rebase on deployment-puppetmaster:/var/lib/git/operations/puppet
  • 22:05 thcipriani: reapplied beta::deployaccess to mediawiki servers
  • 21:42 cscott: updated OCG to version 0bf27e3452dfdc770317f15793e93e6e89c7865a
  • 21:36 cscott: starting OCG deploy
  • 13:43 hashar: Added integration-slave-trusty-1014 back in the pool
  • 13:41 hashar: Tip of the day: to reboot an instance and bypass molly-guard: /sbin/reboot
  • 13:39 hashar: integration-slave-trusty-1014 upgrading packages, clean up and rebooting it
  • 13:37 hashar: marked integration-slave-trusty-1014 offline. Cant run job / get stuck somehow
  • 10:21 godog: add role::prometheus::node_exporter to classes in hiera:deployment-prep T144502

2016-10-01

  • 09:41 hashar: beta: shutdown deployment-db1 and deployment-db2 . Databases have been migrated to other hosts T138778

2016-09-29

2016-09-28

  • 23:56 MaxSem: Deleted varnish cache files on deployment-cache-upload04 to free up space, disk full
  • 21:48 hasharAway: deployment-tin: service nscd restart
  • 21:43 hasharAway: beta cluster update database is broken :/ Filled T146947 about it
  • 21:25 hasharAway: deployment-tin: sudo -H -u www-data php5 /srv/mediawiki-staging/multiversion/MWScript.php update.php --wiki=commonswiki --quick
  • 21:18 hasharAway: https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/ is broken for unkwnon reason :(
  • 20:48 hasharAway: Deleted deployment-tin02 via Horizon. Replaced by deployment-tin
  • 20:19 hasharAway: restarted keyholder on deployment-tin
  • 20:11 hasharAway: Switch Jenkins slave deployment-mira.eqiad to deployment-tin.eqiad
  • 20:09 hasharAway: deployment-tin: keyholder arm
  • 20:08 hasharAway: deployment-tin for instance in `grep deployment /etc/dsh/group/mediawiki-installation`; do ssh-keyscan `dig +short $instance` >> /etc/ssh/ssh_known_hosts; done;
  • 19:49 hasharAway: Dropping deployment-tin02 , replacing it with deployment-tin which has been rebuild to Jessie T144006
  • 12:44 hashar: Cant finish up the switch to deployment-tin, puppet still does not pass due to weird clone issues ...
  • 11:48 hashar: Deleting deployment-tin Trusty instance and recreate one with same hostname as Jessie; Meant to replace deployment-tin02 T144006
  • 10:44 hashar: CI updating all mwext-Wikibase* jenkins jobs for https://gerrit.wikimedia.org/r/#/c/313056/ T142158
  • 10:43 hashar: Updating slave scripts for "Disable garbage collection for mw-phpunit.sh" https://gerrit.wikimedia.org/r/313051 T142158
  • 08:31 hashar: Reloading Zuul to deploy dc2ada37

2016-09-27

2016-09-26

  • 23:58 bd808: Started udp2log-mw on deployment-fluorine02 for T146723
  • 11:35 hashar: deployment-salt02 : autoremoving a bunch of java related packages
  • 11:31 hashar: rebooting deployment-salt02 has a kernel soft lock while hitting the disk
  • 11:24 hashar: beta: mass upgrading all debian packages on all instances
  • 10:32 hashar: beta: on deployment-pdf01 rm -fR /home/cscott/tmp/npm*
  • 10:29 hashar: deployment-pdf01 apt-get upgrade / cleaning files left over etc
  • 10:28 hashar: beta: on deployment-pdf01 rm -fR /home/cscott/.npm/ T145343

2016-09-24

  • 20:08 hashar: deployment-tin is shutdown. Replaced by Jessie deployment-tin02
  • 20:02 hashar: deployment-mira: ssh-keyscan deployment-tin02.deployment-prep.eqiad.wmflabs >> /etc/ssh/ssh_known_hosts
  • 20:00 hashar: beta: dropping deployment-tin (ubuntu) replaced by deployment-tin02 (jessie). Primary is still deployment-mira (https://gerrit.wikimedia.org/r/#/c/312654/ T144578 )

2016-09-23

2016-09-22

  • 19:29 hasharAway: switching Jenkins slaves workspace from /mnt/jenkins-workspace to /srv/jenkins-workspace (actually the same dir/inode on the filesystem)
  • 01:52 legoktm: deploying https://gerrit.wikimedia.org/r/312158

2016-09-21

  • 18:22 yuvipanda: shutting down integration-puppetmaster
  • 17:26 yuvipanda: cherry-pick https://gerrit.wikimedia.org/r/#/c/312044/ on deployment-puppetmaser
  • 16:41 hashar: deployment-tin02 initiale provisioning is complete. Gotta add it as a deployment server via a puppet.git patch
  • 16:01 hashar: deployment-tin02 applied puppet classes beta::autoupdater, beta::deployaccess, role::deployment::server, role::labs::lvm::srv
  • 15:32 hashar: spawned deployment-tin02
  • 14:55 hashar: removed the CI puppet class from deployment-sca01 and deployment-sca02 . Stopped services using /srv , unmounted /srv, removed it from /etc/fstab
  • 14:27 hashar: deployment-sca01 and deployment-sca02 are now broken. The CI puppet class mount /srv which ends up being only 500 MBytes
  • 14:08 hashar: deployment-mira adding puppet class beta::autoupdater
  • 14:06 hashar: Enabling Jenkins slave deployment-mira
  • 14:05 hashar: deployment-mira seems ready for action and is the primary deployment server. Enabling jenkins to it
  • 11:25 hashar: removing Jenkins slave deployment-tin , deployment-mira is the new deployment master T144578
  • 10:58 hashar: Changing Jenkins slaves home dir for deployment-sca01 and deployment-sca02 from /mnt/home/jenkins-deploy to /srv/jenkins/home/jenkins-deploy
  • 10:57 hashar: Changing Jenkins slaves home dir for deployment-tin and deployment-mira from /mnt/home/jenkins-deploy to /srv/jenkins/home/jenkins-deploy
  • 10:10 hashar: deployment-mira removing "role::labs::lvm::srv" duplicate with role::ci::slave::labs::common
  • 10:07 hashar: Making deployment-mira a Jenkins slave by applying puppet class role::ci::slave::labs::common T144578
  • 10:05 hashar: Arming keyholder on deployment-mira
  • 09:43 hashar: beta: switching master deployment server from deployment-tin to deployment-mira
  • 09:34 hashar: From Hiera:deployment-prep remove bit already in puppet: "scap::deployment_server": deployment-tin.deployment-prep.eqiad.wmflabs
  • 08:55 moritzm: remove mira from deployment-prep (replaced by deployment-mira)
  • 08:37 hashar: beta: manually rebased puppetmaster
  • 08:11 elukey: terminated jobrunner01 and removed from deployment-prep's sacp dsh list
  • 07:19 legoktm: deploying https://gerrit.wikimedia.org/r/311927

2016-09-20

  • 21:49 hashar: Deleting deployment-mira02 /srv was too small. Replaced by deployment-mira
  • 20:54 hashar: from deployment-tin for T144578, accept ssh host key of deployment-mira : sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mira.deployment-prep.eqiad.wmflabs
  • 20:47 hashar: Creating deployment-mira instance with flavor c8.m8.s60 (8 cpu, 8G RAM and 60G disk) T144578
  • 19:00 thcipriani: cherry-picked https://gerrit.wikimedia.org/r/#/c/311760/ to deployment-puppetmaster to fix failing beta-scap-eqiad job, had to manually start rsync, puppet failed to start
  • 18:38 hashar: on tin: `sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mira02.deployment-prep.eqiad.wmflabs` - T144006
  • 18:33 hashar: on deployment-mira02 ran `sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mediawiki04.deployment-prep.eqiad.wmflabs` per T144006
  • 18:01 marxarelli: deployed mediawiki-config changes on beta cluster. back in read/write mode using new database instances
  • 17:37 marxarelli: deployment-db04 restored from backup and replication started
  • 16:54 marxarelli: upgraded package and data to mariadb 10 on deployment-db03
  • 16:31 marxarelli: cherry picking operations/puppet patches (T138778) to deployment-puppetmaster
  • 16:30 moritzm: rebooting deployment-mira02
  • 16:23 marxarelli: applied innodb transaction logs to deployment-db1 backup and successfully restored on deployment-db03
  • 15:47 marxarelli: completed innobackupex on deployment-db1. copying backup to deployment-db03 for restoration
  • 14:54 hashar: beta: cherry picking fix up for the jobrunner logging https://gerrit.wikimedia.org/r/#/c/311702/ and https://gerrit.wikimedia.org/r/311719 T146040
  • 14:44 marxarelli: entering read-only mode on beta cluster
  • 14:27 elukey: stopped puppet, jobrunner and jobchron on deployment-jobrunner01
  • 14:20 marxarelli: disabling beta cluster jenkins jobs in preparation for data migration (T138778)
  • 13:07 godog: add deployment-prometheus01 instance T53497
  • 11:20 elukey: applied beta::deployaccess, role::labs::lvm::srv, role::mediawiki::jobrunner to jobrunner02
  • 10:45 elukey: created deployment-jobrunner02 in deployment-prep

2016-09-19

  • 22:01 legoktm: shutdown integration-puppetmaster
  • 21:29 yuvipanda: regenerated client certs only on integration-puppetmaster01, seems ok now
  • 20:46 yuvipanda: re-enable puppet everywhere
  • 20:43 yuvipanda: enable puppet and run on integration-slave-trusty-1003.eqiad.wmflabs
  • 20:41 yuvipanda: accidentally deleted /var/lib/puppet/ssl on integration-puppetmaster01 as well, causing it to lose keys. Reprovision by pointing to labs puppetmaster
  • 20:34 yuvipanda: rm -rf /var/lib/puppet/ssl on all integration nodes
  • 20:34 yuvipanda: copied /etc/puppet/puppet.conf from integration-trusty-slave-1001 to all integration
  • 20:25 yuvipanda: delete /etc/puppet/puppet.conf.d/10-self.conf and /var/lib/puppet/ssl on integration-slave-trusty-1001
  • 20:20 yuvipanda: re-enabled puppet on integration-slave-trusty-1001
  • 20:08 yuvipanda: reset puppetmaster of integration-puppetmaster01 to be labs puppetmaster
  • 20:03 yuvipanda: disable puppet across integration project, moving puppetmasters
  • 19:49 legoktm: creating T144951 enabled role::puppetmaster::standalone role on integration-puppetmaster01
  • 19:33 legoktm: creating T144951 integration-puppetmaster01 instance using m1.small and debian jessie
  • 15:11 hashar: beta: updating jobrunner service 0dc341f..a0e8216

2016-09-17

2016-09-16

  • 21:03 hashar: deployment-tin did a git gc on /srv/deployment/ores That freed up disk space and cleared an alarm on co master mira02
  • 21:00 hashar: deleted deployment-parsoid05
  • 20:52 hashar: fixed puppet on deployment-parsoid05 . Temporary instance will delete it later to clear out shinken.wmflabs.org
  • 20:27 hashar: beta: force running puppet in batches of 4 instances: salt --batch 4 -v 'deployment-*' cmd.run 'puppet agent -tv'
  • 20:13 hashar: beta: restarted puppetmaster
  • 20:07 hashar: beta: salt -v '*' cmd.run 'rm -fR /var/lib/puppet/client/ssl/'
  • 20:07 hashar: beta: stopping puppetmaster, rm -f /var/lib/puppet/server/ssl/ca/signed/*
  • 19:53 hashar: beta created instance "deployment-parsoid05" Should be deleted later, that is merely to purge the hostname from Shinken ( http://shinken.wmflabs.org/host/deployment-parsoid05 )
  • 11:42 hashar: beta: apt-get upgrade on deployment-jobrunner01
  • 11:36 hashar: apt-get upgrade on deployment-tin , bring in a new hhvm version and others

2016-09-15

  • 22:29 legoktm: sudo salt '*precise*' cmd.run 'service mysql start', all mysql's are down
  • 16:45 godog: install xenial kernel on deployment-zotero01 and reboot T145793
  • 16:18 hashar: prometheus enabled on all beta cluster instance. Does not support Precise hence puppet will fail on the last two Precise instances deployment-db1 and deployment-db2 until they are migrated to Jessie T138778
  • 15:53 godog: add role::prometheus::node_exporter to classes in hiera:deployment-prep T144502
  • 15:10 hashar: beta: Applying puppet class role::prometheus::node_exporter to mira02 just like mira. That is for godog
  • 15:08 hashar: T144006 Disabled Jenkins job beta-scap-eqiad. On mira02 rm -fR /srv/* . Applying puppet for role::labs::lvm::srv
  • 15:05 hashar: T144006 Applying class role::labs::lvm::srv to mira02 (it is out of disk space :D )
  • 14:45 hashar: T144006 sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@mira02.deployment-prep.eqiad.wmflabs
  • 14:44 hashar: T144006 sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mediawiki05.deployment-prep.eqiad.wmflabs
  • 12:33 elukey: added base::firewall, beta::deployaccess, mediawiki::conftool, role::mediawiki::appserver to mediawiki05
  • 12:20 elukey: terminate mediawiki02 to create mediawiki05
  • 10:48 hashar: beta: cherry picking moritzm patch https://gerrit.wikimedia.org/r/#/c/310793/ "Also handle systemd in keyholder script" T144578
  • 09:33 hashar: T144006 sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mediawiki06.deployment-prep.eqiad.wmflabs
  • 09:10 elukey: executed git pull and then git rebase -i on deployment puppet master
  • 08:52 elukey: terminated mediawiki03 and created mediawiki06
  • 08:45 elukey: removed mediawiki03 from puppet with https://gerrit.wikimedia.org/r/#/c/310749/
  • 02:36 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/310701

2016-09-14

  • 21:37 hashar: integration: setting "ulimit -c 2097152" on all slaves due to Zend PHP segfaulting T142158
  • 14:31 hashar: Added otto to integration labs project
  • 13:28 gehel: upgrading deployment-logstash2 to elasticsearch 2.3.5 - T145404
  • 09:27 hashar: Deleting deployment-mediawiki01 , replaced by deployment-mediawiki04 T144006
  • 07:19 legoktm: sudo salt '*trusty*' cmd.run 'service mysql start', it was down on all trusty salves
  • 07:17 legoktm: mysql just died on a bunch of slaves (trusty-1013, 1012, 1001)

2016-09-13

  • 17:02 marxarelli: re-enabling beta cluster jenkins jobs following maintenance window
  • 16:59 marxarelli: aborting beta cluster db migration due to time constraints and ops outage. will reschedule
  • 15:34 marxarelli: disabled beta jenkins builds while in maintenance mode
  • 15:18 marxarelli: starting 2-hour read-only maintenance window for beta cluster migration
  • 10:06 hashar: beta: manually updated jobrunner install on deployment-jobrunner01 and deployment-tmh01 then reloaded the services with: service jobchron reload
  • 10:02 hashar: Trebuchet is broken for /srv/deployment/jobrunner/jobrunner cant reach the deploy minions somehow. Did the update manually
  • 10:00 hashar: Upgrading beta cluster jobrunner to catch up with upstream b952a7c..0dc341f merely picking up a trivial log change ( https://gerrit.wikimedia.org/r/#/c/297935/ )
  • 09:40 hashar: Unpooled deployment-mediawiki01 from scap and varnish. Shutting down instance. T144006
  • 09:02 hashar: on deployment-tin, accepted mediawiki04 host key for jenkins-deploy user : sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mediawiki04.deployment-prep.eqiad.wmflabs T144006
  • 08:26 hashar: mwdeploy@deployment-mediawiki04 manually accepted ssh host key of deployment-tin T144006
  • 08:17 hashar: beta: manually accepted ssh host key for deployment-mediawiki04 as user mwdeploy on deployment-tin and mira T144006
  • 07:46 gehel: upgrading elasticsearch to 2.3.5 on deployment-elastic0? - T145404

2016-09-12

  • 14:41 elukey: applied base::firewall, beta::deployaccess, mediawiki::conftool, role::mediawiki::appserver to deployment-mediawiki04.deployment-prep.eqiad.wmflabs (Debian jessie instance) - T144006
  • 12:50 gehel: rolling back upgrading elasticsearch to 2.4.0 on deployment-elastic05 - T145058
  • 12:03 gehel: upgrading elasticsearch to 2.4.0 on deployment-elastic0? - T145058
  • 12:01 hashar: Gerrit: made analytics-wmde group to be owned by themselves
  • 11:57 hashar: Gerrit: added ldap/wmde as an included group of the 'wikidata' group. Asked by and demoed to addshore

2016-09-11

2016-09-09

  • 20:53 thcipriani: testing scap 3.2.5-1 on beta cluster
  • 11:08 hashar: Added git tag for latest versions of mediawiki/selenium and mediawiki/ruby/api
  • 09:30 legoktm: Image ci-jessie-wikimedia-1473412532 in wmflabs-eqiad is ready
  • 08:53 legoktm: added phpflavor-php70 label to integration-slave-jessie-100[1-5]
  • 08:49 legoktm: deploying https://gerrit.wikimedia.org/r/309048

2016-09-08

  • 21:33 hashar: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/309413 " Inject PHP_BIN=php5 for php53 jobs"
  • 20:00 hashar: nova delete ci-jessie-wikimedia-369422 (was stuck in deleting state)
  • 19:49 hashar: Nodepool, deleting instances that Nodepool lost track of (from nodepool alien-list)
  • 19:47 hashar: nodepool cant delete: ci-jessie-wikimedia-369422 [ delete | 2.24 hours . Stuck in task_state=deleting  :(
  • 19:46 hashar: Nodepool looping over some tasks since 17:45 ( https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=21&fullscreen )
  • 19:26 legoktm: repooled integration-slave-jessie-1005 now that php7 testing is done
  • 19:19 hashar: integration: salt -v '*' cmd.run 'cd /srv/deployment/integration/slave-scripts; git pull' | https://gerrit.wikimedia.org/r/308931
  • 19:12 hashar: integration: salt -v '*' cmd.run 'cd /srv/deployment/integration/slave-scripts; git pull' | https://gerrit.wikimedia.org/r/309272
  • 17:08 legoktm: deleted integration-jessie-lego-test01
  • 16:50 legoktm: deleted integration-aptly01
  • 10:03 hashar: Delete Jenkins job https://integration.wikimedia.org/ci/job/mwext-VisualEditor-sync-gerrit/ that has been left behind. It is no more needed. T51846 T86659
  • 10:02 hashar: Delete mwext-VisualEditor-sync-gerrit job, already got removed by ostriches in 139d17c8f1c4bcf2bb761e13a6501e4d85684066 . The issue in Gerrit (T51846) has been fixed. Poke T86659 , one less job on slaves.

2016-09-07

  • 20:44 matt_flaschen: Re-enabled beta-code-update-eqiad .
  • 20:35 hashar: Updated security group for deployment-prep labs project. Allow ssh port 22 from contint1001.wikimedia.org (matching rules for gallium). T137323
  • 20:30 hashar: Updated security group for contintcloud and integration labs project. Allow ssh port 22 from contint1001.wikimedia.org (matching rules for gallium). T137323
  • 20:14 matt_flaschen: Temporarily disabled https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/ to test live revert of aa0f6ea
  • 16:09 hashar: Nodepool back in action. Had to manually delete some instances in labs
  • 15:58 hashar: Restarting Nodepool . Lost state when labnet got moved T144945
  • 13:13 hashar: Image ci-jessie-wikimedia-1473253681 in wmflabs-eqiad is ready , has php7 packages. T144872
  • 11:53 hashar: Force refreshing Nodepool jessie snapshot to get PHP7 included T144872
  • 11:03 hashar: integration: cherry pick https://gerrit.wikimedia.org/r/#/c/308955/ "contint: prefer our bin/php alternative" T144872
  • 10:55 hashar: integration: dropped PHP7 cherry pick from puppet master. https://gerrit.wikimedia.org/r/#/c/308918/ has been merged. Pushing it to the fleet of permanent Jessie slaves. T144872
  • 10:37 hashar: beta: cleaning up salt-keys on deployment-salt02 . Bunch of instances got deleted
  • 09:41 hashar: Moving rake jobs back to Nodepool ( T143938 ) with https://gerrit.wikimedia.org/r/#/c/306723/ and https://gerrit.wikimedia.org/r/#/c/306724/
  • 05:57 legoktm: deploying https://gerrit.wikimedia.org/r/308932 https://gerrit.wikimedia.org/r/299697
  • 05:26 legoktm: cherry-picked https://gerrit.wikimedia.org/r/#/c/308918/ onto integration-puppetmaster with a hack that has it only apply to integration-slave-jessie-1005
  • 04:59 legoktm: added Krenair to integration project to help debug puppet stuff
  • 04:35 legoktm: depooled integration-slave-jessie-1005 in jenkins so I can test puppet stuff on it

2016-09-06

  • 13:58 hashar: Qunit jobs should be all fine again now. T144802
  • 13:46 hashar: nodepool.SnapshotImageUpdater: Image ci-jessie-wikimedia-1473169259 in wmflabs-eqiad is ready T144802
  • 13:20 hashar: Rebuilding Nodepool Jessie image to hopefully include libapache-mod-php5 and restore qunit jobs behavior T144802
  • 10:37 hashar: gerrit: mark apps/android/commons hidden since it is now community maintained on GitHub. Will avoid confusion. T127678
  • 09:11 hashar: nodepool.SnapshotImageUpdater: Image ci-trusty-wikimedia-1473152801 in wmflabs-eqiad is ready
  • 09:06 hashar: nodepool.SnapshotImageUpdater: Image ci-jessie-wikimedia-1473152393 in wmflabs-eqiad is ready
  • 09:00 hashar: Trying to refresh Nodepool Jessie image . Image properties have been dropped, should fix it

2016-09-05

  • 14:08 hashar: Refreshing Nodepool base images for Trusty and Jessie. Managed to build new ones after T143769

2016-09-02

2016-09-01

2016-08-31

  • 23:40 bd808: forced puppet run on deployment-salt02. Had not run automatically for 8 hours
  • 23:36 bd808: Deleted /data/scratch on integration-slave-trusty-1016 to fix puppet
  • 23:32 bd808: Deleted /data/scratch on integration-slave-trusty-1013 to fix puppet
  • 23:22 bd808: Deleted /data/scratch on integration-slave-trusty-1012 to fix puppet
  • 23:19 bd808: Deleted /data/scratch on integration-slave-trusty-1011 to fix puppet
  • 23:15 bd808: Deleted /data/scratch on integration-slave-precise-1012 to fix puppet
  • 23:11 bd808: Deleted /data on integration-slave-precise-1011 to fix puppet
  • 23:08 bd808: Deleted /data on integration-slave-jessie-1001 to fix puppet
  • 23:04 bd808: Deleted empty /data, /data/project, and /data/scratch on integration-puppetmaster to fix puppet
  • 22:59 bd808: Deleted empty /data, /data/project, and /data/scratch on integration-publisher to fix puppet
  • 01:44 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/307670

2016-08-30

  • 23:31 yuvipanda: cherry-picking https://gerrit.wikimedia.org/r/#/c/307656/ fixed puppet on the elasticsearch machines!
  • 22:29 yuvipanda: in lieu of blood sacrifice, restart puppetmaster on deployment-pupetmaster
  • 21:44 yuvipanda: use clush to fix puppet.conf of all clients, realize also accidentally set a client's puppet.conf for the server, recover server's old conf file from a cat in shell history, restore, breathe sigh of relief
  • 21:37 yuvipanda: sudo takes like 15s each time, is there no god?
  • 21:36 yuvipanda: managed to get vim into a state where I can not quit it, probably recording a macro. I hate computers
  • 21:16 yuvipanda: deployment-pdf01 fixed manually
  • 21:15 yuvipanda: deployment-pdf02 has proper ssl certs mysteriously without me doing anything
  • 21:06 yuvipanda: moved deployment-db[12], deployment-stream to not use role::puppet::self, attempting to semi-automate rest
  • 20:52 yuvipanda: cherry-picked appropriate patch on deployment-puppetmaster for T120159, did https://wikitech.wikimedia.org/w/index.php?title=Hiera:Deployment-prep/host/deployment-puppetmaster&oldid=818847 to make sure the puppetmaster allows connections from elsewhere
  • 19:48 legoktm: deploying https://gerrit.wikimedia.org/r/306710
  • 19:13 bd808: Fixed puppet runs on deployment-sca0[12] with cherry-pick of https://gerrit.wikimedia.org/r/#/c/307561
  • 18:57 bd808: Duplicate declaration: File[/srv/deployment] is already declared in file /etc/puppet/modules/contint/manifests/deployment_dir.pp:14; cannot redeclare at /etc/puppet/modules/service/manifests/deploy/common.pp:12 on node deployment-sca01.deployment-prep.eqiad.wmflabs
  • 18:40 bd808: Puppet busted on deployment-aqs01 -- Could not find data item analytics_hadoop_hosts in any Hiera data file and no default supplied at /etc/puppet/manifests/role/aqs.pp:46
  • 12:59 hashar: beta: revert master branch to origin. Ran scap and enabled again beta-code-update-eqiad job.
  • 12:55 hashar: Running scap on beta cluster via https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/117786/console T143889
  • 12:53 hashar: Cherry picking https://gerrit.wikimedia.org/r/#/c/307501/ on beta cluster for T143889
  • 12:51 hashar: disabling https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/ to cherry pick a revert patch

2016-08-29

  • 07:56 hashar: hard rebooting integration-slave-trusty-1012 via horizon and restarting puppet manually
  • 07:50 hashar: integration-slave-trusty-1013 puppet.conf certname was set to 'undef' breaking puppet

2016-08-27

  • 20:51 hashar: integration: tweak sudo policy for jenkins-deploy running cowbuilder: env_keep+=DEB_BUILD_OPTIONS
  • 20:24 hashar: Manually installing jenkins-debian-glue 0.17.0 on integration-slave-jessie-1004 and integration-slave-jessie-1005 ( T142891 ) . That is to support PBUILDER_USENETWORK T141114
  • 20:05 hashar: Jenkins added global env variable BUILD_TIMEOUT set to 30 for T144094

2016-08-26

  • 22:29 legoktm: deploying https://gerrit.wikimedia.org/r/307025
  • 08:15 Amir1: restart uwsgi-ores and celery-ores-worker in deployment-sca03 (T143567)
  • 08:11 hashar: beta-scap-eqiad job is back in operation. Was blocked on logstash not being reachable. T143982
  • 08:10 hashar: deployment-logstash2 is back after a hard reboot. T143982
  • 08:07 hashar: rebooting deployment-logstash02 via Horizon. Kernel hang apparently T143982
  • 08:00 hashar: beta-scap-eqiad failing investigating
  • 07:54 Amir1: cherry-picked 306839/1 into deployment-puppetmaster
  • 00:28 twentyafterfour: restarted puppetmaster service on deployment-puppetmaster

2016-08-25

  • 23:15 Amir1: cherry-picked 306839/1 into puppetmaster
  • 20:10 hashar: Delete integration-slave-trusty-1023 with label AndroidEmulator. The Android job has been migrated to a new Jessie based instance via T138506
  • 19:05 hashar: hard rebooting integration-raita via Horizon
  • 16:04 hashar: fixing puppet.conf on integration-slave-trusty-1013 it mysteriously considered itself as the puppetmaster
  • 16:02 hashar: integration restarted puppetmaster service
  • 08:28 hashar: beta update database fixed
  • 08:28 hashar: beta cluster update database failed due to: "Your composer.lock file is up to date with current dependencies!" Probably a race condition with ongoing scap.

2016-08-24

  • 15:14 halfak: deploying ores d00171
  • 09:50 hashar: deployment-redis02 fixed AOF file /srv/redis/deployment-redis02-6379.aof and restarted the redis instance should fix T143655 and might help T142600
  • 09:43 hashar: T143655 stopping redis 6379 on deployment-redis02 : initctl stop redis-instance-tcp_6379
  • 09:38 hashar: deployment-redis02 initctl stop redis-instance-tcp_6379 && initctl start redis-instance-tcp_6379 | That did not fix it magically though T143655

2016-08-23

2016-08-22

  • 23:40 legoktm: updating slave_scripts on all slaves

2016-08-18

  • 22:03 bd808: deployment-fluorine02: Hack 'datasets:x:10003:997::/home/datasets:/bin/bash' into /etc/passwd for T117028
  • 20:30 MaxSem: Restarted hhvm on appservers for wikidiff2 upgrades
  • 19:03 MaxSem: Upgrading hhvm-wikidiff2 in beta cluster
  • 16:53 legoktm: deploying https://gerrit.wikimedia.org/r/#/c/305532/

2016-08-17

  • 22:28 legoktm: deploying https://gerrit.wikimedia.org/r/305408
  • 21:33 cscott: updated OCG to version e3e0fd015ad8fdbf9da1838c830fe4b075c59a29
  • 21:28 bd808: restarted salt-minion on deployment-pdf02
  • 21:26 bd808: restarted salt-minion on deployment-pdf01
  • 21:15 cscott: starting OCG deploy to beta
  • 14:10 gehel: upgrading elasticsearch to 2.3.4 on deployment-logstash2.deployment-prep.eqiad.wmflabs
  • 13:28 gehel: upgrading elasticsearch to 2.3.4 on deployment-elastic*.deployment-prep + JVM upgrade

2016-08-16

  • 23:10 thcipriani: max_servers at 6, seeing 6 allocated instances, still seeing 403 already used 10 of 10 instances :((
  • 22:37 thcipriani: restarting nodepool, bumping max_servers to match up with what openstack seems willing to allocate (6)
  • 09:06 Amir1: removing ores-related-cherry-picked commits from deployment-puppetmaster

2016-08-15

  • 21:30 thcipriani: update scap on beta to 3.2.3-1 bugfix release
  • 02:30 bd808: Forced a zuul restart -- https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Restart
  • 02:23 bd808: Lots and lots of "AttributeError: 'NoneType' object has no attribute 'name'" errors in /var/log/zuul/zuul.log
  • 02:21 bd808: nodepool delete 301068
  • 02:20 bd808: nodepool delete 301291
  • 02:20 bd808: nodepool delete 301282
  • 02:19 bd808: nodepool delete 301144
  • 02:11 bd808: nodepool delete 299641
  • 02:11 bd808: nodepool delete 278848
  • 02:08 bd808: Aug 15 02:07:48 labnodepool1001 nodepoold[24796]: Forbidden: Quota exceeded for instances: Requested 1, but already used 10 of 10 instances (HTTP 403)

2016-08-13

2016-08-12

2016-08-10

2016-08-09

2016-08-08

  • 23:33 Tim: deleted instance deployment-depurate01
  • 16:19 bd808: Manually cleaned up root@logstash02 cronjobs related to logstash03
  • 14:39 Amir1: deploying d00159c for ores in sca03
  • 10:14 Amir1: deploying 616707c into sca03 (for ores)

2016-08-07

  • 12:01 hashar: Nodepool: can't spawn instances due to: Forbidden: Quota exceeded for instances: Requested 1, but already used 10 of 10 instances (HTTP 403)
  • 12:01 hashar: nodepool: deleted servers stuck in "used" states for roughly 4 hours (using: nodepool list , then nodepool delete <id>)
  • 11:54 hashar: Nodepool: can't spawn instances due to: Forbidden: Quota exceeded for instances: Requested 1, but already used 10 of 10 instances (HTTP 403)
  • 11:54 hashar: nodepool: deleted servers stuck in "used" states for roughly 4 hours (using: nodepool list , then nodepool delete <id>)

2016-08-06

  • 12:31 Amir1: restarting uwsgi-ores and celery-ores-worker in deployment-sca03
  • 12:28 Amir1: cherry-picked 303356/1 into the puppetmaster
  • 12:00 Amir1: restarting uwsgi-ores and celery-ores-worker in deployment-sca03

2016-08-05

2016-08-04

  • 20:07 marxarelli: Running jenkins-jobs update config/ 'selenium-*' to deploy https://gerrit.wikimedia.org/r/#/c/302775/
  • 17:03 legoktm: jstart -N qamorebots /usr/lib/adminbot/adminlogbot.py --config ./confs/qa-logbot.py

2016-08-01

  • 20:28 thcipriani: restarting deployment-ms-be01, not responding to ssh, mw-fe01 requests timing out
  • 08:28 Amir1: deploying fedd675 to ores in sca03

2016-07-29

2016-07-28

  • 21:46 hashar_: xintegration: change sudo policy for jenkins-deploy to help on T141538 : env_keep+=WORKSPACE
  • 12:18 hashar: installed 2.1.0-391-gbc58ea3-wmf1jessie1 on zuul-dev-jessie.integration.eqiad.wmflabs T140894
  • 12:18 hashar: installed 2.1.0-391-gbc58ea3-wmf1jessie1 on zuul-dev-jessie.integration.eqiad.wmflabs
  • 09:46 hashar: Nodepool: Image ci-trusty-wikimedia-1469698821 in wmflabs-eqiad is ready
  • 09:35 hashar: Regenerated Nodepool image for Trusty. The snapshot failed while upgrading grub-pc for some reason. Noticed with thcipriani yesterday

2016-07-27

  • 16:13 hashar: salt -v '*slave-trusty*' cmd.run 'service mysql start' ( was missing on integration-slave-trusty-1011.integration.eqiad.wmflabs )
  • 14:03 hashar: upgraded zuul on gallium via dpkg -i /root/zuul_2.1.0-391-gbc58ea3-wmf1precise1_amd64.deb (revert is zuul_2.1.0-151-g30a433b-wmf4precise1_amd64.deb )
  • 12:43 hashar: restarted Jenkins for some trivial plugins updates
  • 12:35 hashar: hard rebooting integration-slave-trusty-1011 from Horizon. ssh lost, no log in Horizon.
  • 09:46 hashar: manually triggered debian-glue on all operations/debs repo that had no jenkins-bot vote. Via zuul enqueue on gallium and list fetched from "gerrit query --current-patch-set 'is:open NOT label:verified=2,jenkins-bot project:^operations/debs/.*'|egrep '(ref|project):'"
  • 06:21 Tim: created instance deployment-depurate01 for testing of role::html5depurate

2016-07-26

  • 20:13 hashar: Zuul deployed https://gerrit.wikimedia.org/r/301093 which adds 'debian-glue' job on all of operations/debs/ repos
  • 18:10 ostriches: zuul: reloading to pick up config change
  • 12:49 godog: cherry-pick https://gerrit.wikimedia.org/r/#/c/300827/ on deployment-puppetmaster
  • 11:59 legoktm: also pulled in I73f01f87b06b995bdd855628006225879a17fee5
  • 11:59 legoktm: deploying https://gerrit.wikimedia.org/r/301109
  • 11:37 hashar: rebased integration puppetmaster git repo
  • 11:31 hashar: enable puppet agent on integration-puppetmaster . Had it disabled while hacking on https://gerrit.wikimedia.org/r/#/c/300830/
  • 08:42 hashar: T141269 On integration-slave-trusty-1018 , deleting workspace that has a corrupt git: rm -fR /mnt/jenkins-workspace/workspace/mediawiki-extensions-hhvm*
  • 01:08 Amir1: deployed ores a291da1 in sca03, ores-beta.wmflabs.org works as expected

2016-07-25

  • 22:45 legoktm: restarting zuul due to depends-on lockup
  • 14:24 godog: bounce puppetmaster on deployment-puppetmaster
  • 13:17 godog: cherry-pick https://gerrit.wikimedia.org/r/#/c/300827/ on deployment-puppetmaster

2016-07-23

  • 20:06 bd808: Cleanup jobrunner01 logs via -- sudo logrotate --force /etc/logrotate.d/mediawiki_jobrunner
  • 20:03 bd808: Deleted jobqueues in redis with no matching wikis: ptwikibooks, labswiki
  • 19:20 bd808: jobrunner01 spamming /var/log/mediawiki with attempts to process jobs for wiki=labswiki

2016-07-22

  • 20:26 hashar: T141114 upgraded jenkins-debian-glue from v0.13.0 to v0.17.0 on integration-slave-jessie-1001 and integration-slave-jessie-1002
  • 19:07 thcipriani: beta-cluster has successfully used a canary for mediawiki deployments
  • 16:53 thcipriani: bumping scap to v.3.2.1 on deployment-tin to test canary deploys, again
  • 16:46 thcipriani: rolling back scap version to v.3.2.0
  • 16:38 thcipriani: bumping scap to v.3.2.1 on deployment-tin to test canary deploys
  • 13:02 hashar: zuul rebased patch queue on tip of upstream branch and force pushed branch. c3d2810...4ddad4e HEAD -> patch-queue/debian/precise-wikimedia (forced update)
  • 10:32 hashar: Jenkins restarted and it pooled both integration-slave-jessie-1002 and integration-slave-trusty-1018
  • 10:23 hashar: Jenkins has some random deadlock. Will probably reboot it
  • 10:17 hashar: Jenkins can't ssh / add slaves integration-slave-jessie-1002 or integration-slave-trusty-1018 . Apparently due to some Jenkins deadlock in the ssh slave plugin :-/ Lame way to solve it: restart Jenkins
  • 10:10 hashar: rebooting integration-slave-jessie-1002 and integration-slave-trusty-1018 . Hang somehow
  • 10:06 hashar: T141083 salt -v '*slave-trusty*' cmd.run 'service mysql start'
  • 09:55 hashar: integration-slave-trusty-1001 service mysql start

2016-07-21

  • 16:11 hashar: Updated our JJB fork cherry picking f74501e781f by madhuvishy. Was made to support the maven release plugin. Branch bump is 10f2bcd..6fcaf39
  • 16:04 hashar: integration/zuul.git .Updated upstream branch:bc58ea34125f11eb353abc3e5b96ac1efad06141 finally caught up with upstream \O/
  • 15:13 hashar: integration/zuul.git .Updated upstream branch: 06770a85fcff810fc3e1673120710100fc7b0601:upstream
  • 14:03 hashar: integration/zuul.git bumping upstream branch: git push d34e0b4:upstream
  • 03:18 greg-g: had to do https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Hung_beta_code.2Fdb_update twice, seems to be back
  • 00:13 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/299825/ to deployment-puppetmaster so wdqs nginx log parsing can be tested

2016-07-20

  • 13:55 hashar: beta: switching job beta-scap-eqiad to use 'scap sync' per https://gerrit.wikimedia.org/r/#/c/287951/ (poke thcipriani )
  • 12:47 hashar: integration: enabled unattended upgrade on all instances by adding contint::packages::apt to https://wikitech.wikimedia.org/wiki/Hiera:Integration
  • 10:28 hashar: beta dropped salt-key on deployment-salt02 for the three instances: deployment-upload.deployment-prep.eqiad.wmflabs , deployment-logstash3.deployment-prep.eqiad.wmflabs and deployment-ores-web.deployment-prep.eqiad.wmflabs
  • 10:26 hashar: beta: rebased puppetmaster git repo. "Parsoid: Move to service::node" has weird conflict https://gerrit.wikimedia.org/r/#/c/298436/
  • 10:15 hashar: beta: removing puppet cherry pick of https://gerrit.wikimedia.org/r/#/c/258979/ "mediawiki: add conftool-specifc credentials and scripts" abandonned/superseeded and caused a conflict
  • 08:17 hashar: deployment-fluorine : deleting a puppet lock file /var/lib/puppet/state/agent_catalog_run.lock (created at 2016-07-18 19:58:46 UTC)
  • 01:53 legoktm: deploying https://gerrit.wikimedia.org/r/299930

2016-07-18

  • 20:56 thcipriani: Deleted deployment-fluorine:/srv/mw-log/archive/*-201605* freed 30 GB
  • 15:00 hashar: Upgraded Zuul on the Precise slaves to zuul_2.1.0-151-g30a433b-wmf4precise1
  • 12:10 hashar: (restarted qa-morebots)
  • 12:10 hashar: Enabling puppet again on integration-slave-precise-1002 , removing Zuul-server config and adding the slave back in Jenkins pool

2016-07-16

  • 23:19 paladox: testing morebots

2016-07-15

  • 08:34 hashar: Unpooling integration-slave-precise-1002 will use it as a zuul-server test instance temporarily

2016-07-14

  • 18:54 ebernhardson: deployment-prep manually edited elasticsearch.yml on deployment-elastic05 and restarted to get it listening on eth0. Still looking into why puppet wrote out wrong config file
  • 09:05 Amir1: rebooting deployment-ores-redis
  • 08:29 Amir1: deploying 0e9555f to ores-beta (sca03)

2016-07-13

  • 16:05 urandom: Installing Cassandra 2.2.6-wmf1 on deployment-restbase0[1-2].deployment-prep.eqiad.wmflabs : T126629
  • 13:58 hashar: T137525 reverted Zuul back to zuul_2.1.0-95-g66c8e52-wmf1precise1_amd64.deb . It could not connect to Gerrit reliably
  • 13:46 hashar: T137525 Stopped zuul that ran in a terminal (with -d). Started it with the init script.
  • 11:37 hashar: apt-get upgrade on deployment-mediawiki02
  • 08:33 hashar: removing deployment-parsoid05 from the Jenkins slaves T140218

2016-07-12

  • 20:29 hashar: integration: force running unattended upgrade on all instances: salt --batch 4 -v '*' cmd.run 'unattended-upgrade' . That upgrades diamond and hhvm among others. imagemagick-common has a prompt though
  • 20:22 hashar: CI force running puppet on all instances: salt --batch 5 -v '*' puppet.run
  • 20:04 hashar: Maybe fix unattended upgrade on the CI slaves via https://gerrit.wikimedia.org/r/298568
  • 16:43 Amir1: deploying f472f65 to ores-beta
  • 10:11 hashar: Github created repos operations-debs-contenttranslation-apertium-mk-en and operations-docker-images-toollabs-images for Gerrit replication

2016-07-11

  • 14:24 hashar: Removing ZeroMQ config from the Jenkins jobs. It is now enabled globally. T139923
  • 10:16 hashar: T136188: on Trusty slaves, upgrading Chromium from v49 to v51: salt -v '*slave-trusty-*' cmd.run 'apt-get -y install chromium-browser chromium-chromedriver chromium-codecs-ffmpeg-extra'
  • 10:13 hashar: T136188: salt -v '*slave-trusty*' cmd.run 'rm /etc/apt/preferences.d/chromium-*'
  • 10:09 hashar: Unpinning Chromium v49 from the Trusty slaves and upgrading to v51 for T136188
  • 09:34 zeljkof: Enabled ZMQ Event Publisher on all Jobs in Jenkins

2016-07-09

2016-07-08

2016-07-07

  • 21:41 MaxSem: Chowned php-master/vendor back to jenkins-deploy
  • 13:10 hashar: deleting integration-slave-trusty-1024 and integration-slave-trusty-1025 to free up some RAM. We have enough permanent Trusty slaves. T139535
  • 02:43 MaxSem: started redis-server on deployment-stream
  • 01:14 bd808: Restarted logstash on deployment-logstash2
  • 01:13 MaxSem: Leaving my hacks for the night to collect data, if needed revert with cd /srv/mediawiki-staging/php-master/vendor && sudo git reset --hard HEAD && sudo chown -hR jenkins-deploy:wikidev .
  • 00:50 bd808: Rebooting deployment-logstash3.eqiad.wmflabs; console full of hung process messages from kernel
  • 00:27 MaxSem: Initialized ORES on all wikis where it's enabled, was causing job failures
  • 00:13 MaxSem: Debugging a fatal in betalabs, might cause syncs to fail

2016-07-06

  • 20:30 hashar: beta: restarted mysql on both db1 and db2 so it takes in account the --syslog setting T119370
  • 20:08 hashar: beta: on db1 and db2 move the MariaDB 'syslog' setting under [mysqld_safe] section. Cherry picked https://gerrit.wikimedia.org/r/#/c/296713/3 and reloaded mysql on both instances. T119370
  • 14:54 hashar: Image ci-jessie-wikimedia-1467816381 in wmflabs-eqiad is ready T133779
  • 14:47 hashar_: attempting to refresh ci-jessie-wikimedia image to get librdkafka-dev included for T133779

2016-07-05

  • 21:54 hasharAway: CI has drained the gate-and-submit queue
  • 21:37 hasharAway: Nodepool: nodepool delete a few instances that would never spawn / have been stuck for ~ 40 minutes

2016-07-04

  • 18:58 hashar: Upgrading arcanist on permanent CI slaves since xhpast was broken T137770
  • 12:50 yuvipanda: migrating deployment-tin to labvirt1011

2016-07-03

  • 13:10 paladox: phabricator Update phab-01 and phab-05 (phab-02) and phab-03 to fix a security bug in phabricator (Did the update last night but forgot to log it)
  • 12:04 jzerebecki: reloading zuul for 7e6a2e2..13ea50f

2016-07-02

  • 13:38 jzerebecki: reloading zuul for 15127b2..7e6a2e2

2016-06-30

  • 10:31 hashar: Deleting integration-slave-trusty-1015 . Can not bring up mysql T138074 and the ssh slave connection would not hold anyway. Must be broken somehow
  • 10:04 hashar: Attempting to refresh Nodepool image for Jessie ( ci-jessie-wikimedia ). Been stall for 284 hours (12 days)
  • 09:36 hashar: Trusty is missing the package arcanist ... :(
  • 09:35 hashar: Attempting to refresh Nodepool image for Trusty ( ci-trusty-wikimedia ). Been stall for 283 hours (12 days)

2016-06-28

  • 21:33 halfak: deploying ores beec291
  • 21:15 halfak: deploying ores 6979a98

2016-06-27

  • 22:32 eberhardson: deployment-prep deployed gerrit.wikimedia.org/r/296279 to puppetmaster to test kibana4 role
  • 19:41 bd808: Rebooting deployment-logstash3.eqiad.wmflabs via wikitech. Console log full of blocked kworker messages, ssh non-responsive, and blocking logstash records being recorded.
  • 18:20 thcipriani: deployment-puppetmaster.deployment-prep:/var/lib/git/labs/private modules/secret/secrets/keyholder keys conflicts resolved
  • 18:09 bd808: Git repo at deployment-puppetmaster.deployment-prep:/var/lib/git/labs/private is behind upstream due to multiple modules/secret/secrets/keyholder local files that would be overwritten by upstream changes.

2016-06-24

2016-06-23

  • 13:58 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/295691
  • 12:13 hashar: Deleting integration-saltmaster and recreating it with Jessie T136410
  • 10:14 hashar: T137807 Upgrading Jenkins TAP Plugin
  • 08:55 hashar: integration: rebased puppet master by dropping a conflicting/obsolete patch
  • 08:28 hashar: fixing puppet cert on deployment-cache-text04

2016-06-17

  • 10:35 jzerebecki: offlined integration-slave-trusty-1015 T138074
  • 10:06 hashar: Refreshed Nodepool Trusty image
  • 10:02 hashar: Refreshed Nodepool Jessie image

2016-06-14

  • 14:22 hashar: T136971 on tin MediaWiki 1.28.0-wmf.6, from 1.28.0-wmf.6, successfully checked out. Applying security patches
  • 11:21 hashar: T137797 Created Gerrit repository operations/debs/geckodriver to package https://github.com/mozilla/geckodriver

2016-06-13

  • 21:11 hashar: https://integration.wikimedia.org/ci/computer/integration-slave-trusty-1015/ put offline. Jenkins cant ssh / pool it for some reason
  • 20:07 hashar: beta: update.php / database update finally pass!
  • 19:55 hashar: T137615 deployment-db2, **eswiki** > CREATE INDEX echo_notification_event ON echo_notification (notification_event);
  • 19:22 hashar: T137615 deployment-db2, enwiki > CREATE INDEX echo_notification_event ON echo_notification (notification_event);
  • 10:37 hashar: Restarted puppetmaster on integration-puppetmaster (memory leak / can not fork: no memory)
  • 10:35 hashar: T137561 salt -v '*trusty*' cmd.run "cd /root/ && dpkg -i firefox_46.0.1+build1-0ubuntu0.14.04.3_amd64.deb"
  • 10:23 hashar: Hard reboot integration-slave-trusty-1015
  • 08:30 hashar: Beta: `mwscript extensions/Echo/maintenance/removeInvalidTargetPage.php --wiki=enwiki` for T137615

2016-06-10

2016-06-09

  • 18:49 hashar: restarting nutcracker on deployment-mediawiki02
  • 16:53 hashar: rebuild Nodepool trusty image ci-trusty-wikimedia-1465490962
  • 16:37 hashar: Manually deleting old zuul references on scandium.eqiad.wmnet . Running in a screen
  • 16:32 hashar: rebuild Nodepool jessie image ci-jessie-wikimedia-1465489579
  • 16:03 hashar: Restarting Nodepool

2016-06-08

  • 02:56 legoktm: / on gallium is read-only
  • 02:47 legoktm: disabling/enabling gearman in jenkins because everything is stuck

2016-06-07

  • 19:28 hashar: Nodepool has troubles spawning instances probably due to on going (?) labs maintenance
  • 14:56 hashar: Restarting Jenkins to upgrade Rebuilder plugin with https://github.com/jenkinsci/rebuild-plugin/pull/34 (sort out parameters not being reinjected)
  • 09:02 hashar: Upgrading Jenkins IRC plugin 2.25..2.27 and instant messaging plugin 1.34..1.35 . The former should fix a deadlock on shutdowning Jenkins | T96183

2016-06-06

  • 19:26 hasharAway: Regenerating Nodepool snapshots for Trusty and Jessie
  • 13:04 hashar: Migrated all qunit jobs to Nodepool T136301 has the related Gerrit changes
  • 10:05 hashar: migrating mediawiki-core-qunit job to Nodepool instances https://gerrit.wikimedia.org/r/#/c/291322/ T136301

2016-06-04

  • 00:09 Krinkle: krinkle@integration-slave-trusty-1017:~$ sudo rm -rf /mnt/jenkins-workspace/workspace/mediawiki-extensions-hhvm/src/extensions/Babel (T86730)

2016-06-03

  • 19:18 hashar: Image ci-jessie-wikimedia-1464981111 in wmflabs-eqiad is ready Zend 5.x for qunit | T136301
  • 15:17 hashar: refreshed Nodepool Trusty image due to some imagemagick upgrade issue. Image ci-trusty-wikimedia-1464966671 in wmflabs-eqiad is ready
  • 10:40 hashar: scandium (zuul merger): rm -fR /srv/ssd/zuul/git/mediawiki/extensions/Collection T136930

2016-06-02

  • 12:10 hashar: Upgraded Zuul upstream code being 66c8e52..30a433b package is 2.1.0-151-g30a433b-wmf1precise1

2016-06-01

  • 17:49 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/292186
  • 16:45 tgr: enabling AuthManager on beta cluster
  • 15:20 legoktm: deploying https://gerrit.wikimedia.org/r/292153
  • 14:44 twentyafterfour: jenkins restart completed
  • 14:36 twentyafterfour: restarting jenkins to install "single use slave" plugin (jenkins will restart when all builds are finished)
  • 13:49 hashar: Beta : clearing temporary files under /data/project/upload7 (mainly wikimedia/commons/temp )
  • 10:29 hashar: Upgraded Linux kernel on deployment-salt02 T136411
  • 10:14 hashar: beta: salt-key -d deployment-salt.deployment-prep.eqiad.wmflabs T136411
  • 09:16 hashar: Enabling puppet again on Trusty slaves. Chromium is now properly pinned to version 49 ( https://gerrit.wikimedia.org/r/#/c/291116/3 | T136188 )
  • 08:55 hashar: integration slaves : salt -v '*' pkg.upgrade

2016-05-31

  • 20:24 bd808: Reloading zuul to pick up I58f878f3fd19dfa21a46a52464575cb06aacbb22

2016-05-30

  • 18:39 hashar: Upgraded our Jenkins Job Builder fork to 1.5.0 + a couple of cherry picks: cd63874...10f2bcd
  • 12:53 hashar: Upgrading Zuul 1cc37f7..66c8e52 T128569
  • 08:04 ori: zuul is back up but jobs which were enqueued are gone
  • 07:50 ori: restarting jenkins on gallium, too
  • 07:49 ori: restarted zuul-merger service on gallium
  • 07:44 ori: Disconnecting and then reconnecting Gearman from Jenkins did not appear to do anything; going to depool / repool nodes.
  • 07:42 ori: Temporarily disconnecting Gearman from Jenkins, per <https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Known_issues>

2016-05-28

  • 04:43 ori: depooling integration-slave-trusty-1015 to profile phpunit runs

2016-05-27

  • 19:29 hasharAway: Refreshed Nodepool images
  • 18:13 thcipriani: restarting zuul for deadlock
  • 18:00 thcipriani: Reloading Zuul to deploy I0c3aeacf92d430ad1272f5f00e7fb7182b8a05bf
  • 02:55 bd808: Deleted deployment-fluorine:/srv/mw-log/archive/*-20160[34]* logs; freed 26G

2016-05-26

  • 22:23 hashar: salt -v '*trusty*' cmd.run 'puppet agent --disable "Chromium needs to be v49. See T136188"'
  • 21:47 hashar: integration-slave-trusty-1015 still on Chromium 50 .. T136188
  • 21:42 hashar: downgrading chromium-browser on integration-slave-1015 T136188
  • 09:24 jzerebecki: reloading zuul for d38ad0a..6798539
  • 07:48 gehel: deployment-prep upgrading elasticsearch to 2.3.3 and restarting (T133124)
  • 07:36 dcausse: deployment-prep elastic: updating cirrussearch warmers (T133124)
  • 07:31 gehel: deployment-prep deploying new elasticsearch plugins (T133124)

2016-05-25

  • 22:38 Amir1: running puppet agent manually on sca01
  • 16:26 hashar: 2016-05-25 16:24:35,491 INFO nodepool.image.build.wmflabs-eqiad.ci-trusty-wikimedia: Notice: /Stage[main]/Main/Package[ruby-jsduck]/ensure: ensure changed 'purged' to 'present' T109005
  • 15:07 hashar: g++ added to Jessie and Trusty Nodepool instances | T119143
  • 14:12 hashar: Regenerating Nodepool snapshot to include g++ which is required by some NodeJS native modules T119143
  • 10:58 hashar: Updating Nodepool ci-jessie-wikimedia snapshot image to get netpbm package installed into it. T126992 https://gerrit.wikimedia.org/r/290651
  • 09:30 hashar: Clearing git-sync-upstream script on integration-slave-trusty1013 and integration-slave-trusty-1017. That is only supposed to be on the puppetmaster
  • 09:15 hashar: Fixed resolv.conf on integration-slave-trusty-1013 and force running puppet to catch up with change since May 16 19:52
  • 09:11 hashar: restarting puppetmaster on integration-puppetmaster ( memory leak / can not fork)

2016-05-24

  • 07:03 mobrovac: rebooting deployment-tin, can't log in

2016-05-23

  • 19:35 hashar: killed all mysqld process on Trusty CI slaves
  • 15:49 thcipriani: beta code update not running, disconnect-reconnect dance resulted in: [05/23/16 15:48:39] [SSH] Authentication failed.
  • 14:32 jzerebecki: offlined integration-slave-trusty-1004 because it can't connect to mysql T135997
  • 13:32 hashar: Upgrading Jenkins git plugins and restarting Jenkins
  • 11:01 hashar: Upgrading hhvm on Trusty slaves. Bring him hhvm compiled against libicu52 instead of libicu48
  • 09:12 _joe_: deployment-prep: all hhvm hosts in beta upgraded to run on the newer libicu; now running updateCollation.php (T86096)
  • 09:11 hashar: Image ci-jessie-wikimedia-1463994307 in wmflabs-eqiad is ready
  • 09:01 hashar: Image ci-trusty-wikimedia-1463993508 in wmflabs-eqiad is ready
  • 08:56 _joe_: deployment-prep: starting upgrade of HHVM to a version linked to libicu52, T86096
  • 08:54 hashar: Regenerating Nodepool image manually. Broke over the week-end due to a hhvm/libicu transition. Should get pip 8.1.x now

2016-05-20

2016-05-19

  • 16:47 thcipriani: deployment-tin jenkins worker seems to be back online after some prodding
  • 16:41 thcipriani: beta-code-update eqiad hung for past few hours
  • 15:16 hashar: Restarted zuul-merger daemons on both gallium and scandium : file descriptors leaked
  • 11:59 hashar: CI: salt -v '*' cmd.run 'pip install --upgrade pip==8.1.2'
  • 11:54 hashar: Upgrading pip on CI slaves from 7.0.1 to 8.1.2 https://gerrit.wikimedia.org/r/#/c/289639/
  • 10:15 hashar: puppet broken on deployment-tin :  ?[1;31mError: Could not retrieve catalog from remote server: Error 400 on SERVER: Invalid parameter trusted_group on node deployment-tin.deployment-prep.eqiad.wmflabs?[0m

2016-05-18

  • 13:16 Amir1: deploying a05e830 to ores nodes (sca01 and ores-web)
  • 12:46 urandom: (re)cherry-picking c/284078 to deployment-prep
  • 11:36 hashar: Restarted qa-morebots
  • 11:36 hashar: Marked mediawiki/core/vendor repository has hidden in Gerrit. It got moved to mediawiki/vendor including the whole history Settings page: https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/core/vendor

2016-05-13

  • 14:39 thcipriani: remove shadow l10nupdate user from deployment-tin and mira in beta
  • 10:20 hashar: Put integration-slave-trusty-1004 offline. Ssh/passwd is borked T135217
  • 09:59 hashar: Deleting non nodepool mediawiki PHPUnit jobs for T135001 (mediawiki-phpunit-hhvm mediawiki-phpunit-parsertests-hhvm mediawiki-phpunit-parsertests-php55 mediawiki-phpunit-php55)
  • 04:06 thcipriani|afk: changed ownership of mwdeploy public keys post shadow mwdeploy user removal is important
  • 03:47 thcipriani|afk: ldap failure has created a shadow mwdeploy user on beta, deleted using vipw

2016-05-12

  • 22:53 bd808: Started dead mysql on integration-slave-precise-1011

2016-05-11

  • 21:05 hashar: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/288128 #T134946
  • 20:26 hashar: rebooting integration-slave-trusty-1016 is back up
  • 20:15 hashar: rebooting integration-slave-trusty-1016 unreachable somehow
  • 16:43 hashar: Reduced number of executors on Trusty instances from 3 to 2. Memory get exhausted causing the tmpfs to drop files and thus MW jobs to fail randomly.
  • 13:33 hashar: Added contint::packages::php to Nodepool images T119139
  • 12:59 hashar: Dropping texlive and its dependencies from gallium.
  • 12:52 hashar: deleted integration-dev
  • 12:51 hashar: creating integration-dev instance to hopefully have Shinken clean itself
  • 11:42 hashar: rebooting deployment-aqs01 via wikitech T134981
  • 10:46 hashar: beta/ci puppetmaster : deleting old tags in /var/lib/git/operations/puppet and repacking the repos
  • 08:49 hashar: Deleting instances deployment-memc02 and deployment-memc03 (Precise instances, migrated to Jessie) #T134974
  • 08:43 hashar: Beta: switching memcached to new Jessie servers by cherry picking https://gerrit.wikimedia.org/r/#/c/288156/ and running puppet on mw app servers #T134974
  • 08:20 hashar: Creating deployment-memc04 and deployment-memc05 to switch beta cluster memcached to Jessie. m1.medium with security policy "cache" T13497
  • 01:44 matt_flaschen: Created Flow-specific External Store tables (blobs_flow1) on all wiki databases on Beta Cluster: T128417

2016-05-10

  • 19:17 hashar: beta / CI purging old Linux kernels: salt -v '*' cmd.run 'dpkg -l|grep ^rc|awk "{ print \$2 }"|grep linux-image|xargs dpkg --purge'
  • 17:34 cscott: updated OCG to version b0c57a1c6890e9fa1f2c3743fc14cb6a7f244fc3
  • 16:44 bd808: Cleaned up 8.5G of pbuilder tmp output on integration-slave-jessie-1001 with `sudo find /mnt/pbuilder/build -maxdepth 1 -type d -mtime +1 -exec rm -r {} \+`
  • 16:35 bd808: https://integration.wikimedia.org/ci/job/debian-glue failure on integration-slave-jessie-1001 due to /mnt being 100$ full
  • 14:20 hashar: deployment-puppetmaster mass cleaned packages/service/users etc T134881
  • 13:54 moritzm: restarted zuul-merger on scandium for openssl update
  • 13:52 moritzm: restarting zuul on gallium for openssl update
  • 13:51 moritzm: restarted apache and zuul-merger on gallium for openssl update
  • 13:48 hashar: deployment-puppetmaster : dropping role::ci::jenkins_access role::ci::slave::labs and role::ci::slave::labs::common T134881
  • 13:46 hashar: Deleting Jenkins slave deployment-puppetmaster T134881
  • 13:45 hashar: Change https://integration.wikimedia.org/ci/job/beta-build-deb/ job to use label selector "DebianGlue && DebianJessie" instead of "BetaDebianRepo" T134881
  • 13:33 hashar: Migrating all debian glue jobs to Jessie permanent slaves T95545
  • 13:30 hashar: Adding integration-slave-jessie-1002 in Jenkins. it is all puppet compliant
  • 12:59 thcipriani|afk: triggering puppet run on scap targets in beta for https://gerrit.wikimedia.org/r/#/c/287918/ cherry pick
  • 09:07 hashar: fixed puppet.conf on deployment-cache-text04

2016-05-09

  • 20:58 hashar: Unbroke puppet on integration-raita.integration.eqiad.wmflabs . Puppet was blocked because role::ci::raita was no more. Fixed by rebasing https://gerrit.wikimedia.org/r/#/c/208024 T115330
  • 20:13 hashar: beta: salt -v '*' cmd.run 'dpkg --purge libganglia1 ganglia-monitor; rm -fR /etc/ganglia' # T134808
  • 20:06 hashar: CI, removing ganglia configuration entirely via: salt -v '*' cmd.run 'rm -fRv /etc/ganglia' # T134808
  • 20:04 hashar: CI, removing ganglia configuration entirely via: salt -v '*' cmd.run 'dpkg --purge ganglia-monitor' # T134808
  • 16:32 jzerebecki: reloading zuul for 3e2ab56..d663fd0
  • 15:39 andrewbogott: migrating deployment-flourine to labvirt1009
  • 15:39 hashar: Adding label contintLabsSlave to integration-slave-jessie1001 and integration-slave-jessie1002
  • 15:26 hashar: Creating integration-slave-jessie-1001 T95545

2016-05-06

  • 19:45 urandom: Restart cassandra-metrics-collector on deployment-restbase0[1-2]
  • 19:41 urandom: Rebasing 02ae1757 on deployment-puppetmaster : T126629

2016-05-05

  • 22:09 MaxSem: Promoted Yurik and Jgirault to sysops on beta enwiki. Through shell because logging in is broken for me.

2016-05-04

  • 21:28 cscott: deployed puppet FQDN domain patch for OCG: https://gerrit.wikimedia.org/r/286068 and restarted ocg on deployment-pdf0[12]
  • 15:03 hashar: beta-scap: deployment-tin.deployment-prep.eqiad.wmflabs Name or service not known
  • 15:03 hashar: beta-scap: deployment-tin.deployment-prep.eqiad.wmflabs
  • 12:24 hashar: deleting Jenkins job mediawiki-core-phpcs , replaced by Nodepool version mediawiki-core-phpcs-trusty T133976
  • 12:11 hashar: beta: restarted nginx on varnish caches ( systemctl restart nginx.service ) since they were not listening on port 443 #T134362
  • 11:07 hashar: restarted CI puppetmaster (out of memory leak)
  • 10:57 hashar: CI: mass upgrading deb packages
  • 10:53 hashar: beta: clearing out leftover apt conf that points to unreachable web proxy : salt -v '*' cmd.run "find /etc/apt -name '*-proxy' -delete"
  • 10:48 hashar: Manually fixing nginx upgrade on deployment-cache-text04 and deployment-cache-upload04 see T134362 for details
  • 09:27 hashar: deployment-cache-text04 systemctl stop varnish-frontend.service . To clear out all the stuck CLOSE_WAIT connections T134346
  • 08:33 hashar: fixed puppet on deployment-cache-text04 (race condition generating puppet.conf )

2016-05-03

  • 23:21 bd808: Changed "Maximum Number of Retries" for ssh agent launch in jenkins for deployment-tin from "0" to "10"
  • 23:01 twentyafterfour: rebooting deployment-tin
  • 23:00 bd808: Jenkins agent on deployment-tin not spawning; investigating
  • 20:02 hashar: Restarting Jenkins
  • 16:49 hashar: Notice: /Stage[main]/Contint::Packages::Python/Package[pypy]/ensure: ensure changed 'purged' to 'present' | T134235
  • 16:46 hashar: Refreshing Nodepool Jessie image to have it include pypy | T134235 poke @jayvdb
  • 14:49 mobrovac: deployment-tin rebooting it
  • 14:25 hashar: beta salt -v '*' pkg.upgrade
  • 14:19 hashar: beta: added unattended upgrade to Hiera::deployment-prep
  • 13:30 hashar: Restarted nslcd on deployment-tin , pam was refusing authentication for some reason
  • 13:29 hashar: beta: got rid of a leftover Wikidata/Wikibase patch that broke scap salt -v 'deployment-tin*' cmd.run 'sudo -u jenkins-deploy git -C /srv/mediawiki-staging/php-master/extensions/Wikidata/ checkout -- extensions/Wikibase/lib/maintenance/populateSitesTable.php'
  • 13:23 hashar: deployment-tin force upgraded HHVM from 3.6 to 3.12
  • 09:42 hashar: adding puppet class contint::slave_scripts to deployment-sca01 and deployment-sca02 . Ships multigit.sh T134239
  • 09:31 hashar: Deleting CI slave deployment-cxserver03 , added deployment-sca01 and deployment-sca02 in Jenkins. T134239
  • 09:28 hashar: deployment-sca01 removing puppet lock /var/lib/puppet/state/agent_catalog_run.lock and running puppet again
  • 09:26 hashar: Applying puppet class role::ci::slave::labs::common on deployment-sca01 and deployment-sca02 (cxserver and parsoid being migrated T134239 )
  • 03:33 kart_: Deleted deployment-cxserver03, replaced by deployment-sca0x

2016-05-02

  • 21:27 cscott: updated OCG to version b775e612520f9cd4acaea42226bcf34df07439f7
  • 21:26 hashar: Nodepool is acting just fine: Demand from gearman: ci-trusty-wikimedia: 457 | <AllocationRequest for 455.0 of ci-trusty-wikimedia>
  • 21:25 hashar: restarted qa-morebots "2016-05-02 21:22:23,599 ERROR: Died in main event loop"
  • 21:23 hashar: gallium: enqueued 488 jobs directly in Gearman. That is to test https://gerrit.wikimedia.org/r/#/c/286462/ ( mediawiki/extensions to hhvm/zend5.5 on Nodepool). Progress /home/hashar/gerrit-286462.log
  • 20:14 hashar: MediaWiki phpunit jobs to run on Nodepool instances \O/
  • 16:41 urandom: Forcing puppet run and restarting Cassandra on deployment-restbase0[1-2] : T126629
  • 16:40 urandom: Cherry-picking https://gerrit.wikimedia.org/r/operations/puppet refs/changes/78/284078/12 to deployment-puppetmaster : T126629
  • 16:24 urandom: Restarat Cassandra on deployment-restbase0[1-2] : T126629
  • 16:21 urandom: forcing puppet run on deployment-restbase0[1-2] : T126629
  • 16:21 urandom: cherry-picking latest refs/changes/78/284078/11 onto deployment-puppetmaster : T126629
  • 09:44 hashar: On zuul-merger instances (gallium / scandium), cleared out pywikibot/core working copy ( rm -fR /srv/ssd/zuul/git/pywikibot/core/ ) T134062

2016-04-30

  • 18:31 Amir1: deploying d4f63a3 from github.com/wiki-ai/ores-wikimedia-config into targets in beta cluster via scap3

2016-04-29

  • 16:37 jzerebecki: restarting zuul for 4e9d180..ebb191f
  • 15:45 hashar: integration: deleting integration-trusty-1026 and cache-rsync . Maybe that will clear them up from Shinken
  • 15:14 hashar: integration: created 'cache-rsync' and 'integration-trusty-1026' , attempting to have Shinken to deprovision them

2016-04-28

  • 22:03 urandom: deployment-restbase01 upgrade to 2.2.6 complete : T126629
  • 21:56 urandom: Stopping Cassandra on deployment-restbase01, upgrading package to 2.2.6, and forcing puppet run : T126629
  • 21:55 urandom: Snapshotting Cassandra tables on deployment-restbase01 (name = 1461880519833) : T126629
  • 21:55 urandom: Snapshotting Cassandra tables on deployment-restbase01 : T126629
  • 21:52 urandom: Forcing puppet run on deployment-restbase02 : T126629
  • 21:51 urandom: Cherry picking operations/puppet refs/changes/78/284078/10 to puppmaster : T126629
  • 20:46 urandom: Starting Cassandra on deployment-restbase02 (now v2.2.6) : T126629
  • 20:41 urandom: Re-enable puppet and force run on deployment-restbase02 : T126629
  • 20:38 urandom: Halting Cassandra on deployment-restbase02, masking systemd unit, and upgrading package(s) to 2.2.6 : T126629
  • 20:37 urandom: Snapshotting Cassandra tables on deployment-restbase02 (snapshot name = 1461875833996) : T126629
  • 20:37 urandom: Snapshotting Cassandra tables on deployment-restbase02 : T126629
  • 20:33 urandom: Cassandra on deployment-restbase01.deployment-prep started : T126629
  • 20:25 urandom: Restarting Cassandra on deployment-restbase01.deployment-prep : T126629
  • 20:14 urandom: Re-enable puppet on deployment-restbase01.deployment-prep, and force a run : T126629
  • 20:12 urandom: cherry-picking https://gerrit.wikimedia.org/r/#/c/284078/ to deployment-puppetmaster : T126629
  • 20:06 urandom: Disabling puppet on deployment-restbase0[1-2].deployment-prep : T126629
  • 14:43 hashar: Rebuild Nodepool Jessie image. Comes with hhvm
  • 12:52 hashar: Puppet is happy on deployment-changeprop
  • 12:47 hashar: apt-get upgrade deployment-changeprop (outdated exim package)
  • 12:42 hashar: Rebuild Nodepool Trusty instance to include the PHP wrapper script T126211

2016-04-27

  • 23:57 thcipriani: nodepool instances running again after an openstack rabbitmq restart by andrewbogott
  • 22:51 duploktm: also ran openstack server delete ci-jessie-wikimedia-85342
  • 22:42 legoktm: nodepool delete 85342
  • 22:41 matt_flaschen: Deployed https://gerrit.wikimedia.org/r/#/c/285765/ to enable External Store everywhere on Beta Cluster
  • 22:38 legoktm: stop/started nodepool
  • 22:36 thcipriani: I don't have permission to restart nodepool
  • 22:35 thcipriani: restarting nodepool
  • 22:18 matt_flaschen: Deployed https://gerrit.wikimedia.org/r/#/c/282440/ to switch Beta Cluster to use External Store for new testwiki writes
  • 21:00 hashar: thcipriani downgraded git plugins successfully (we wanted to rule out their upgrade for some weird issue)
  • 20:13 cscott: updated OCG to version e39e06570083877d5498da577758cf8d162c1af4
  • 14:10 hashar: restarting Jenkins
  • 14:09 hashar: Jenkins upgrading credential plugin 1.24 > 1.27 And Credentials binding plugin 1.6 > 1.7
  • 14:07 hashar: Jenkins upgrading git plugin 2.4.1 > 2.4.4
  • 14:01 hashar: Jenkins upgrading git client plugin 1.19.1. > 1.19.6
  • 13:13 jzerebecki: reloading zuul for 81a1f1a..0993349
  • 11:43 hashar: fixed puppet on deployment-cache-text04 T132689
  • 10:38 hashar: Rebuild Image ci-trusty-wikimedia-1461753210 in wmflabs-eqiad is ready
  • 09:43 hashar: tmh01.deployment-prep.eqiad.wmflabs denies mwdeploy user breaking https://integration.wikimedia.org/ci/job/beta-scap-eqiad/

2016-04-26

  • 20:45 hashar: Regenerating Nodepool Jessie snapshot to include composer and HHVM | T128092
  • 20:23 jzerebecki: reloading zuul for eb480d8..81a1f1a
  • 19:25 jzerebecki: reload zuul for 4675213..eb480d8
  • 19:25 jzerebecki: 4675213..eb480d8
  • 14:18 hashar: Applied security patches to 1.27.0-wmf.22 | T131556
  • 12:39 hashar: starting cut of 1.27.0-wmf.22 branch ( poke ostriches )
  • 10:29 hashar: restored integration/phpunit on CI slaves due to https://integration.wikimedia.org/ci/job/operations-mw-config-phpunit/ failling
  • 09:11 hashar: CI is back up!
  • 08:20 hashar: shutoff instance castor, does not seem to be able to start again :( | T133652
  • 08:12 hashar: hard rebooting castor instance | T133652
  • 08:10 hashar: soft rebooting castor instance | T133652
  • 08:06 hashar: CI jobs deadlocked due to castor being unavailable | https://phabricator.wikimedia.org/T133652
  • 00:46 thcipriani: temporary keyholder fix in place in beta
  • 00:18 thcipriani: beta-scap-eqiad failure due to bad keyholder-auth.d fingerprints

2016-04-25

  • 20:58 cscott: updated OCG to version 58a720508deb368abfb7652e6a8c7225f95402d2
  • 19:46 hashar: Nodepool now has a couple trusty instances intended to experiment with Zend 5.5 / HHVM migration . https://phabricator.wikimedia.org/T133203#2236625
  • 13:34 hashar: Nodepool is attempting to create a Trusty snapshot with name ci-trusty-wikimedia-1461591203 | T133203
  • 13:15 hashar: openstack image create --file /home/hashar/image-trusty-20160425T124552Z.qcow2 ci-trusty-wikimedia --disk-format qcow2 --property show=true # T133203
  • 10:38 hashar: Refreshing Nodepool Jessie snapshot based on new image
  • 10:35 hashar: Refreshed Nodepool Jessie image ( image-jessie-20160425T100035Z )
  • 09:24 hashar: beta / scap failure filled as T133521
  • 09:20 hashar: Keyholder / mwdeploy ssh keys have been messed up on beta cluster somehow :-(
  • 08:47 hashar: mwdeploy@deployment-tin has lost ssh host keys file :(

2016-04-24

  • 17:14 jzerebecki: reloading e06f1fe..672fc84

2016-04-22

2016-04-21

  • 19:07 thcipriani: scap version testing should be done, puppet should no longer be disabled on hosts
  • 18:02 thcipriani: disabling puppet on scap targets to test scap_3.1.0-1+0~20160421173204.70~1.gbp6706e0_all.deb

2016-04-20

  • 22:28 thcipriani: rolling back scap version in beta, legit failure :(
  • 21:52 thcipriani: testing new scap version in beta on deployment-tin
  • 17:54 thcipriani: Reloading Zuul to deploy gerrit:284494
  • 13:58 hashar: Stopping HHVM on CI slaves by cherry picking a couple puppet patches | T126594
  • 13:33 hashar: salt -v '*trusty*' cmd.run 'rm /usr/lib/x86_64-linux-gnu/hhvm/extensions/current' # Cleanup on CI slaves for T126658
  • 13:27 hashar: Restarted integration puppet master service (out of memory / mem leak)

2016-04-17

2016-04-16

  • 14:21 Krenair: restarted qa-morebots per request
  • 14:18 Krenair: <jzerebecki> !log reloading zuul for 3f64dbd..c6411a1

2016-04-13

2016-04-12

  • 19:47 bd808: Cleaned up large hhbc cache file on deployment-medaiwiki03 via `sudo service hhvm stop; sudo rm /var/cache/hhvm/fcgi.hhbc.sq3; sudo service hhvm start`
  • 19:47 bd808: Cleaned up large hhbc cache file on deployment-medaiwiki02 via `sudo service hhvm stop; sudo rm /var/cache/hhvm/fcgi.hhbc.sq3; sudo service hhvm start`
  • 19:46 bd808: Cleaned up large hhbc cache file on deployment-medaiwiki01 via `sudo service hhvm stop; sudo rm /var/cache/hhvm/fcgi.hhbc.sq3; sudo service hhvm start`
  • 19:10 Amir1: manually rebooted deployment-ores-web
  • 19:08 Amir1: manually cherry-picked 282992/2 into to puppetmaster
  • 17:05 Amir1: ran puppet agen in sca01 manually in /srv directory
  • 11:34 hashar: Jenkins upgrading "Script Security Plugin" from 1.17 to 1.18.1 https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2016-04-11

2016-04-11

  • 21:23 csteipp: deployed and reverted oath
  • 20:30 thcipriani: relaunched slave-agent on integration-slave-trusty-1025, back online
  • 20:19 thcipriani: integration-slave-trusty-1025 horizon console filled with INFO: task jbd2/vda1-8:170 blocked for more than 120 seconds. rebooting
  • 20:13 thcipriani: killing stuck jobs, marking integration-slave-trusty-1025 as offline temporarily
  • 14:42 thcipriani: deployment-mediawiki01 disk full :(

2016-04-08

  • 22:46 matt_flaschen: Created blobs1 table for all wiki DBs on Beta Cluster
  • 14:34 hashar: Image ci-jessie-wikimedia-1460125717 in wmflabs-eqiad is ready adds package 'unzip' | T132144
  • 12:49 hashar: Image ci-jessie-wikimedia-1460119481 in wmflabs-eqiad is ready , adds package 'zip' | T132144
  • 09:30 hashar: Removed label hasAndroidSdk from gallium . That prevent that slave from sometime running the job apps-android-commons-build 
  • 08:42 hashar: Rebased puppet master and fixed conflict with https://gerrit.wikimedia.org/r/#/c/249490/

2016-04-07

  • 20:16 hashar: deployment-mediawiki02.deployment-prep.eqiad.wmflabs , cleared up random left over stuff / big logs etc
  • 20:08 hashar: deployment-mediawiki02.deployment-prep.eqiad.wmflabs / is full

2016-04-05

  • 23:56 marxarelli: Removed cherry-pick and rebased /var/lib/git/operations/puppet on integration-puppetmaster after merge of https://gerrit.wikimedia.org/r/#/c/281706/
  • 21:58 marxarelli: Restarting puppetmaster on integration-puppetmaster
  • 21:53 marxarelli: Cherry picked https://gerrit.wikimedia.org/r/#/c/281706/ on integration-puppetmaster and applying on integration-slave-trusty-1014
  • 10:32 hashar: gallium removing texlive
  • 10:29 hashar: gallium removing libav / ffmpeg. No more needed since jobs are no more running on that server

2016-04-04

  • 17:30 greg-g: Phabricator going down in about 10 minutes to hopefully address the overheating issue: T131742
  • 10:06 hashar: integration: salt -v '*-slave*' cmd.run 'rm /usr/local/bin/grunt; rm -fR /usr/local/lib/node_modules/grunt-cli' | T124474
  • 10:04 hashar: integration: salt -v '*-slave*' cmd.run 'npm -g uninstall grunt-cli' | T124474
  • 03:15 greg-g: Phabricator is down

2016-04-03

2016-04-02

  • 22:58 Amir1: added local hack to pupetmaster to make scap3 provider more verbose
  • 19:46 hashar: Upgrading Jenkins Gearman plugin to v2.0 , bring in diff registration for faster updates of Gearman server
  • 14:39 Amir1: manually added 281170/5 to beta puppetmaster
  • 14:22 Amir1: manually added 281161/1 to beta puppetmaster
  • 11:31 Reedy: deleted archived logs older than 30 days from deployment-fluorine

2016-04-01

  • 22:16 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/281046
  • 21:13 hashar: Image ci-jessie-wikimedia-1459544873 in wmflabs-eqiad is ready
  • 20:57 hashar: Refreshing Nodepool snapshot to hopefully get npm 2.x installed T124474
  • 20:37 hashar: Added Luke081515 as a member of deployment-prep (beta cluster) labs project
  • 20:31 hashar: Dropping grunt-cli from the permanent slaves. People can have it installed by listing it in their package.json devDependencies https://gerrit.wikimedia.org/r/#/c/280974/
  • 14:06 hashar: integration: removed sudo policy permitting sudo as any member of the project for any member of the project, which included jenkins-deploy user
  • 14:05 hashar: integration: removed sudo policy permitting sudo as root for any member of the project, which included jenkins-deploy user
  • 11:23 bd808: Freed 4.5G on deployment-fluorine:/srv/mw-log by deleting wfDebug.log
  • 04:00 Amir1: manually rebooted deployment-sca01
  • 00:16 csteipp: created oathauth_users table on centralauth db in beta

2016-03-31

  • 21:19 legoktm: deploying https://gerrit.wikimedia.org/r/280756
  • 13:52 hashar: rebasing integration puppetmaster (it had some merge commit )
  • 01:40 Krinkle: Purge npm cache in integration-slave-trusty-1015:/mnt/home/jenkins-deploy/.npm was corrupted around March 23 19:00 for unknown reasons (T130895)

2016-03-30

  • 19:32 twentyafterfour: deleted some nutcracker and hhvm log files on deployment-mediawiki01 to free space
  • 15:37 hashar: Gerrit has trouble sending emails T131189
  • 13:48 Reedy: deployment-prep Make that deployment-tmh01
  • 13:48 Reedy: deployment-prep upgrade hhvm on deployment-mediawiki01 and reboot
  • 13:35 Reedy: deployment-prep upgrade hhvm on deployment-mediawiki03 and reboot
  • 12:16 gehel: deployment-prep restarting varnish on deployment-cache-text04
  • 11:04 Amir1: cherry-picked 280413/1 in beta puppetmaster, manually running puppet agent in deployment-ores-web
  • 10:22 Amir1: cherry-picking 280403 to beta puppetmaster and manually running puppet agent in deployment-ores-web

2016-03-29

  • 23:22 marxarelli: running jenkins-jobs update config/ 'mwext-donationinterfacecore125-testextension-zend53' to deploy https://gerrit.wikimedia.org/r/#/c/280261/
  • 19:52 Amir1: manually updated puppetmaster, deleted SSL cert key in deployment-ores-web in VM, running puppet agent manually
  • 02:20 jzerebecki: reloading zuul fo 46923c8..c0937ee

2016-03-26

  • 22:38 jzerebecki: reloading zuul for 2d7e050..46923c8

2016-03-25

  • 23:55 marxarelli: deleting instances integration-slave-trusty-1002 and integration-slave-trusty-1005
  • 23:54 marxarelli: deleting jenkins nodes integration-slave-trusty-1002 and integration-slave-trusty-1005
  • 23:41 marxarelli: completed rolling manual deploy of https://gerrit.wikimedia.org/r/#/c/279640/ to trusty slaves
  • 23:27 marxarelli: starting rolling offline/remount/online of trusty slaves to increase tmpfs size
  • 23:22 marxarelli: pooled new trusty slaves integration-slave-trusty-1024 and integration-slave-trusty-1025
  • 23:13 jzerebecki: reloading zuul fro 0aec21d..2d7e050
  • 22:14 marxarelli: creating new jenkins node for integration-slave-trusty-1024
  • 22:11 marxarelli: rebooting integration-slave-trusty-{1024,1025} before pooling as replacements for trusty-1002 and trusty-1005
  • 21:06 marxarelli: repooling integration-slave-trusty-{1005,1002} to help with load while replacement instances are provisioning
  • 16:59 marxarelli: depooling integration-slave-trusty-1002 until DNS resolution can be resolved. still investigating disk space issue

2016-03-24

  • 16:39 thcipriani: restarted rsync service on deployment-tin
  • 13:45 thcipriani|afk: rearmed keyholder on deployment-tin
  • 04:41 Krinkle: beta-update-databases-eqiad and beta-scap-eqiad stuck for over 8 hours (IRC notifier plugin deadlock)
  • 03:28 Krinkle: beta-mediawiki-config-update-eqiadqueued has been stuck for over 5 hours.

2016-03-23

  • 23:00 Krinkle: rm-rf integration-slave-trusty-1013:/mnt/home/jenkins-deploy/tmpfs/jenkins-2/karma-54925082/ (bad permissions, caused Karma issues)
  • 19:02 legoktm: restarted zuul

2016-03-22

2016-03-21

  • 21:55 hashar: zuul: almost all MediaWiki extensions migrated to run the npm job on Nodepool (with Node.js 4.3) T119143 . All tested. Will monitor the build results that ran overnight tomorrow
  • 20:28 hashar: Mass running npm-node-4.3 jobs against MediaWiki extensions to make sure they all pass ( https://gerrit.wikimedia.org/r/#/c/278004/ | T119143 )
  • 17:40 elukey: executed git rebase --interactive on deployment-puppetmaster.deployment-prep.eqiad.wmflabs to remove https://gerrit.wikimedia.org/r/#/c/278713/
  • 15:46 elukey: hacked manually the cdh puppet submodule on deployment-puppetmaster.deployment-prep.eqiad.wmflabs - please let me know if interfere with anybody's tests
  • 14:24 elukey: executed git submodule update --init on deployment-puppetmaster.deployment-prep.eqiad.wmflabs
  • 11:25 elukey: beta: cherry picked https://gerrit.wikimedia.org/r/#/c/278713/ to test an updated to the cdh module (analytics)
  • 11:13 hashar: beta: rebased puppet master which had a conflict on https://gerrit.wikimedia.org/r/#/c/274711/ which got merged meanwhile (saves Elukey )
  • 11:02 hashar: beta: added Elukey (wikimedia ops) to the project as member and admin

2016-03-19

  • 13:04 hashar: Jenkins: added ldap-labs-codfw.wikimedia.org as a fallback LDAP server T130446

2016-03-18

  • 17:16 jzerebecki: reloading zuul for e33494f..89a9659

2016-03-17

  • 21:10 thcipriani: updating scap on deployment-tin to test D133
  • 18:31 cscott: updated OCG to version c1a8232594fe846bd2374efd8f7c20d7e97ac449
  • 09:34 hashar: deployment-jobrunner01 deleted /var/log/apache/*.gz T130179
  • 09:04 hashar: Upgrading hhvm and related extensions on jobrunner01 T130179

2016-03-16

2016-03-15

  • 15:17 jzerebecki: added wikidata.beta.wmflabs.org in https://wikitech.wikimedia.org/wiki/Special:NovaAddress to deployment-cache-text04.deployment-prep.eqiad.wmflabs
  • 14:19 hashar: Image ci-jessie-wikimedia-1458051246 in wmflabs-eqiad is ready T124447
  • 14:14 hashar: Refreshing Nodepool snapshot images so it get a fresh copy of slave-scripts T124447
  • 14:08 hashar: Deploying slave script change https://gerrit.wikimedia.org/r/#/c/277508/ "npm-install-dev.py: Use config.dev.yaml instead of config.yaml" for T124447

2016-03-14

  • 22:18 greg-g: new jobs weren't processing in Zuul, lego fixed it and blamed Reedy
  • 20:13 hashar: Updating Jenkins jobs mwext-Wikibase-* so they no more rely on --with-phpunit ( ping @hoo https://gerrit.wikimedia.org/r/#/c/277330/ )
  • 17:03 Krinkle: Doing full Zuul restart due to deadlock (T128569)
  • 10:18 moritzm: re-enabled systemd unit for logstash on deployment-logstash2

2016-03-11

  • 22:42 legoktm: deploying https://gerrit.wikimedia.org/r/276901
  • 19:41 legoktm: legoktm@integration-slave-trusty-1001:/mnt/jenkins-workspace/workspace$ sudo rm -rf mwext-Echo-testextension-* # because it was broken

2016-03-10

  • 20:22 hashar: Nodepool Image ci-jessie-wikimedia-1457641052 in wmflabs-eqiad is ready
  • 20:19 hashar: Refreshing Nodepool to include the 'varnish' package T128188
  • 20:05 hashar: apt-get upgrade integration-slave-jessie1001 (bring in ffmpeg update and nodejs among other things)
  • 12:22 hashar: Nodeppol Image ci-jessie-wikimedia-1457612269 in wmflabs-eqiad is ready
  • 12:18 hashar: Nodepool: rebuilding image to get mathoid/graphoid packages included (hopefully) T119693 T128280

2016-03-09

  • 17:56 bd808: Cleaned up git clone state in deployment-tin.deployment-prep:/srv/mediawiki-staging/php-master and queued beta-code-update-eqiad to try again (T129371)
  • 17:48 bd808: Git clone at deployment-tin.deployment-prep:/srv/mediawiki-staging/php-master in completely horrible state. Investigating
  • 17:22 bd808: Fixed https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/4452/
  • 17:19 bd808: Manually cleaning up broken rebase in deployment-tin.deployment-prep:/srv/mediawiki-staging
  • 16:27 bd808: Removed cherry-pick of https://gerrit.wikimedia.org/r/#/c/274696 ; manually cleaned up systemd unit and restarted logstash on deployment-logstash2
  • 14:59 hashar: Image ci-jessie-wikimedia-1457535250 in wmflabs-eqiad is ready T129345
  • 14:57 hashar: Rebuilding snapshot image to get Xvfb enabled at boot time T129345
  • 13:04 moritzm: cherrypicked patch to deployment-prep which provides a systemd unit for logstash
  • 10:52 hashar: Image ci-jessie-wikimedia-1457520493 in wmflabs-eqiad is ready
  • 10:29 hashar: Nodepool: created new image and refreshing snapshot in attempt to get Xvfb running T129320 T128090

2016-03-08

  • 23:42 legoktm: running CentralAuth's checkLocalUser.php --verbose=1 --delete=1 on deployment-tin for T115198
  • 21:33 hashar: Nodepool Image ci-jessie-wikimedia-1457472606 in wmflabs-eqiad is ready
  • 19:23 hashar: Zuul inject DISPLAY https://gerrit.wikimedia.org/r/#/c/273269/
  • 16:03 hashar: Image ci-jessie-wikimedia-1457452766 is ready T128090
  • 15:59 hashar: Nodepool: refreshing snapshot image to ship browsers+Xvfb for T128090
  • 14:27 hashar: Mass refreshed CI slave-scripts 1d2c60d..e27c292
  • 13:38 hashar: Rebased integration puppet master. Dropped a make-wmf-branch patch and the one for raita role
  • 11:26 hashar: Nodepool: created new snapshot to set puppet $::labsproject : ci-jessie-wikimedia-1457436175 hoping to fix hiera lookup T129092
  • 02:51 ori: deployment-prep Updating HHVM on deployment-mediawiki01
  • 02:27 ori: deployment-prep Updating HHVM on deployment-mediawiki02
  • 01:50 Krinkle: integration-saltmater: salt -v '*slave-trusty*' cmd.run 'rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm/src/skins/BlueSky' (T117710)
  • 01:50 Krinkle: integration-saltmater: salt -v '*slave-trusty*' cmd.run 'rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm-composer/src/skins/BlueSky'

2016-03-07

  • 21:03 hashar: Nodepool upgraded to 0.1.1-wmf.4 , it no more waits 1 minute before deleted a used node | T118573
  • 20:05 hashar: Upgrading Nodepool from 0.1.1-wmf3 to 0.1.1-wmf.4 with andrewbogott | T118573

2016-03-06

2016-03-04

  • 19:31 hashar: Nodepool Image ci-jessie-wikimedia-1457119603 in wmflabs-eqiad is ready - T128846
  • 13:29 hashar: Nodepool Image ci-jessie-wikimedia-1457097785 in wmflabs-eqiad is ready
  • 08:42 hashar: CI deleting integration-slave-precise-1001 (2 executors). It is not in labs DNS which causes bunch of issues, no need for the capacity anymore. T128802
  • 02:49 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/274889
  • 00:11 Krinkle: salt -v --show-timeout '*slave*' cmd.run "bash -c 'cd /srv/deployment/integration/slave-scripts; git pull'"

2016-03-03

  • 23:37 legoktm: salt -v --show-timeout '*slave*' cmd.run "bash -c 'cd /srv/deployment/integration/slave-scripts; git pull'"
  • 22:34 legoktm: mysql not running on integration-slave-precise-1002, manually starting (T109704)
  • 22:30 legoktm: mysql not running on integration-slave-precise-1011, manually starting (T109704)
  • 22:19 legoktm: mysql not running on integration-slave-precise-1012, manually starting (T109704)
  • 22:07 legoktm: deploying https://gerrit.wikimedia.org/r/274821
  • 21:58 Krinkle: Reloading Zuul to deploy (EventLogging and AdminLinks) https://gerrit.wikimedia.org/r/274821 /
  • 18:49 thcipriani: killing deployment-bastion since it is no longer used
  • 14:23 hashar: https://integration.wikimedia.org/ci/computer/integration-slave-trusty-1011/ is out of disk space

2016-03-02

2016-03-01

  • 23:10 Krinkle: Updated Jenkins configuration to also support php5 and hhvm for Console Sections detection of "PHPUnit"
  • 17:05 hashar: gerrit: set accounts inactive for Eloquence and Mgrover. Former employees of wmf and mail bounceback
  • 16:41 hashar: Restarted Jenkins
  • 16:32 hashar: Bunch of Jenkins job got stall because I have killed threads in Jenkins to unblock integration-slave-trusty-1003 :-(
  • 12:14 hashar: integration-slave-trusty-1003 is back online
  • 12:13 hashar: Might have killed the proper Jenkins thread to unlock integration-slave-trusty-1003
  • 12:03 hashar: Jenkins can not pool back integration-slave-trusty-1003 Jenkins master has a bunch of blocking threads pilling up with hudson.plugins.sshslaves.SSHLauncher.afterDisconnect() locked somehow
  • 11:41 hashar: Rebooting integration-slave-trusty-1003 (does not reply to salt / ssh)
  • 10:34 hashar: Image ci-jessie-wikimedia-1456827861 in wmflabs-eqiad is ready
  • 10:24 hashar: Refreshing Nodepool snapshot instances
  • 10:22 hashar: Refreshing Nodepool base image to speed instances boot time (dropping open-iscsi package https://gerrit.wikimedia.org/r/#/c/273973/ )

2016-02-29

  • 16:23 hashar: salt -v '*slave*' cmd.run 'rm -fR /mnt/jenkins-workspace/workspace/mwext*jslint' T127362
  • 16:17 hashar: Deleting all mwext-.*-jslint jobs from Jenkins. Paladox has migrated all of them to jshint/jsonlint generic jobs T127362
  • 16:16 hashar: Deleting all mwext-.*-jslint jobs from Jenkins. Paladox has migrated all of them to jshint/jsonlint generic jobs
  • 09:46 hashar: Jenkins installing Yaml Axis Plugin 0.2.0

2016-02-28

  • 01:30 Krinkle: Rebooting integration-slave-precise-1012 – Might help T109704 (MySQL not running)

2016-02-26

  • 15:14 jzerebecki: salt -v --show-timeout '*slave*' cmd.run "bash -c 'cd /srv/deployment/integration/slave-scripts; git pull'" T128191
  • 15:14 jzerebecki: salt -v --show-timeout '*slave*' cmd.run "bash -c 'cd /srv/deployment/integration/slave-scripts; git pull'"
  • 14:44 hashar: (since it started, dont be that scared!)
  • 14:44 hashar: Nodepool has triggered 40 000 instances
  • 11:53 hashar: Restarted memcached on deployment-memc02 T128177
  • 11:53 hashar: memcached process on deployment-memc02 seems to have a nice leak of socket usages (from lost) and plainly refuse connections (bunch of CLOSE_WAIT) T128177
  • 11:53 hashar: memcached process on deployment-memc02 seems to have a nice leak of socket usages (from lost) and plainly refuse connections (bunch of CLOSE_WAIT)
  • 11:40 hashar: deployment-memc04 find /etc/apt -name '*proxy' -delete (prevented apt-get update)
  • 11:26 hashar: beta: salt -v '*' cmd.run 'apt-get -y install ruby-msgpack' . I am tired of seeing puppet debug messages: "Debug: Failed to load library 'msgpack' for feature 'msgpack'"
  • 11:24 hashar: puppet keep restarting nutcracker apparently T128177
  • 11:20 hashar: Memcached error for key "enwiki:flow_workflow%3Av2%3Apk:63dc3cf6a7184c32477496d63c173f9c:4.8" on server "127.0.0.1:11212": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY

2016-02-25

  • 22:38 hashar: beta: maybe deployment-jobunner01 is processing jobs a bit faster now. Seems like hhvm went wild
  • 22:23 hashar: beta: jobrunner01 had apache/hhvm killed somehow .... Blame me
  • 21:56 hashar: beta: stopped jobchron / jobrunner on deployment-jobrunner01 and restarting them by running puppet
  • 21:49 hashar: beta did a git-deploy of jobrunner/jobrunner hoping to fix puppet run on deployment-jobrunner01 and apparently it did! T126846
  • 11:21 hashar: deleting workspace /mnt/jenkins-workspace/workspace/browsertests-Wikidata-WikidataTests-linux-firefox-sauce on slave-trusty-1015
  • 10:08 hashar: Jenkins upgraded T128006
  • 01:44 legoktm: deploying https://gerrit.wikimedia.org/r/273170
  • 01:39 legoktm: deploying https://gerrit.wikimedia.org/r/272955 (undeployed) and https://gerrit.wikimedia.org/r/273136
  • 01:37 legoktm: deploying https://gerrit.wikimedia.org/r/273136
  • 00:31 thcipriani: running puppet on beta to update scap to latest packaged version: sudo salt -b '10%' -G 'deployment_target:scap/scap' cmd.run 'puppet agent -t'
  • 00:20 thcipriani: deployment-tin not accepting jobs for some time, ran through https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Hung_beta_code.2Fdb_update, is back now

2016-02-24

  • 19:55 legoktm: legoktm@deployment-tin:~$ mwscript extensions/ORES/maintenance/PopulateDatabase.php --wiki=enwiki
  • 18:30 bd808: "configuration file '/etc/nutcracker/nutcracker.yml' syntax is invalid"
  • 18:27 bd808: nutcracker dead on mediawiki01; investigating
  • 17:20 hashar: Deleted Nodepool instances so new ones get to use the new snapshot ci-jessie-wikimedia-1456333979
  • 17:12 hashar: Refreshing nodepool snapshot. Been stall since Feb 15th T127755
  • 17:01 bd808: https://wmflabs.org/sal/releng missing SAL data since 2016-02-20T20:19 due to bot crash; needs to be backfilled from wikitech data (T127981)
  • 16:43 hashar: sal on elastic search is stall https://phabricator.wikimedia.org/T127981
  • 15:07 hasharAW: beta app servers have lost access to memcached due to bad nutcracker conf | T127966
  • 14:41 hashar: beta: we have a lost a memcached server 11:51am UTC

2016-02-23

  • 22:45 thcipriani: deployment-puppetmaster is in a weird rebase state
  • 22:25 legoktm: running sync-common manually on deployment-mediawiki02
  • 09:59 hashar: Deleted a bunch of mwext-.*-jslint jobs that are no more in used (migrated to either 'npm' or 'jshint' / 'jsonlint' )

2016-02-22

  • 22:06 bd808: Restarted puppetmaster service on deployment-puppetmaster to "fix" error "invalid byte sequence in US-ASCII"
  • 17:46 jzerebecki: ssh integration-slave-trusty-1017.eqiad.wmflabs 'sudo -u jenkins-deploy rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm/src/.git/config.lock
  • 16:47 gehel: deployment-prep upgrading deployment-logstash2 to elasticsearch 1.7.5
  • 10:26 gehel: deployment-prep upgrading elastic-search to 1.7.5 on deployment-elastic0[5-8]

2016-02-20

  • 20:19 Krinkle: beta-code-update-eqiad job repeatedly stuck at "IRC notifier plugin"
  • 19:29 Krinkle: beta-code-update-eqiad broken because deployment-tin:/srv/mediawiki-staging/php-master/extensions/MobileFrontend/includes/MobileFrontend.hooks.php was modified on the server without commit
  • 19:22 Krinkle: Various beta-mediawiki-config-update-eqiad jobs have been stuck 'queued' for > 24 hours

2016-02-19

2016-02-18

2016-02-17

2016-02-16

  • 23:22 yuvipanda: new instances on deployment-prep no longer get NFS because of https://wikitech.wikimedia.org/w/index.php?title=Hiera%3ADeployment-prep&type=revision&diff=311783&oldid=311781
  • 23:18 hashar: jenkins@gallium find /var/lib/jenkins/config-history/nodes -maxdepth 1 -type d -name 'ci-jessie*' -exec rm -vfR {} \;
  • 23:17 hashar: Jenkins accepting slave creations again. Root cause is /var/lib/jenkins/config-history/nodes/ has reached the 32k inode limit.
  • 23:14 hashar: Jenkins: Could not create rootDir /var/lib/jenkins/config-history/nodes/ci-jessie-wikimedia-34969/2016-02-16_22-40-23
  • 23:02 hashar: Nodepool can not authenticate with Jenkins anymore. Thus it can not add slaves it spawned.
  • 22:56 hashar: contint: Nodepool instances pool exhausted
  • 21:14 andrewbogott: deployment-logstash2 migration finished
  • 20:49 jzerebecki: reloading zuul for 3bf7584..67fec7b
  • 19:58 andrewbogott: migrating deployment-logstash2 to labvirt1010
  • 19:00 hashar: tin: checking out mw 1.27.0-wmf.14
  • 15:23 hashar: integration-make-wmfbranch : /mnt/make-wmf-branch mount now has gid=wikidev and group setuid (i.e. mode 2775)
  • 15:20 hashar: integration-make-wmfbranch : change tmpfs to /mnt/make-wmf-branch (from /var/make-wmf-branch )
  • 11:30 jzerebecki: T117710 integration-saltmaster:~# salt -v '*slave-trusty*' cmd.run 'rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm-composer/src/skins/BlueSky'
  • 09:52 hashar: will cut the wmf branches this afternoon starting around 14:00 CET

2016-02-15

2016-02-14

2016-02-13

  • 06:42 bd808: restarted nutcracker on deployment-mediawiki01
  • 06:32 bd808: jobrunner on deployment-jobrunner01 enabled after reverting changes from T87928 that caused T126830
  • 05:51 bd808: disabled jobrunner process on jobrunner01; queue full of jobs broken by T126830
  • 05:31 bd808: trebuchet clone of /srv/jobrunner/jobrunner broken on jobrunner01; failing puppet runs
  • 05:25 bd808: jobrunner process on deployment-jobrunner01 badly broken; investigating
  • 05:20 bd808: Ran https://phabricator.wikimedia.org/P2273 on deployment-jobrunner01.deployment-prep.eqiad.wmflabs; freed ~500M; disk utilization still at 94%

2016-02-12

  • 23:54 hashar: beta cluster broken since 20:30 UTC https://logstash-beta.wmflabs.org/#/dashboard/elasticsearch/fatalmonitor havent looked
  • 17:36 hashar: salt -v '*slave-trusty*' cmd.run 'apt-get -y install texlive-generic-extra' # T126422
  • 17:32 hashar: adding texlive-generic-extra on CI slaves by cherry picking https://gerrit.wikimedia.org/r/#/c/270322/ - T126422
  • 17:19 hashar: get rid of integration-dev it is broken somehow
  • 17:10 hashar: Nodepool back at spawning instances. contintcloud has been migrated in wmflabs
  • 16:51 thcipriani: running sudo salt '*' -b '10%' deploy.fixurl to fix deployment-prep trebuchet urls
  • 16:31 hashar: bd808 added support for saltbot to update tasks automagically!!!! T108720
  • 03:10 yurik: attempted to sync graphoid from gerrit 270166 from deployment-tin, but it wouldn't sync. Tried to git pull sca02, submodules wouldn't pull

2016-02-11

  • 22:53 thcipriani: shutting down deployment-bastion
  • 21:28 hashar: pooling back slaves 1001 to 1006
  • 21:18 hashar: re enabling hhvm service on slaves ( https://phabricator.wikimedia.org/T126594 ) Some symlink is missing and only provided by the upstart script grrrrrrr https://phabricator.wikimedia.org/T126658
  • 20:52 legoktm: deploying https://gerrit.wikimedia.org/r/270098
  • 20:35 hashar: depooling the six recent slaves: /usr/lib/x86_64-linux-gnu/hhvm/extensions/current/luasandbox.so cannot open shared object file
  • 20:29 hashar: pooling integration-slave-trusty-1004 integration-slave-trusty-1005 integration-slave-trusty-1006
  • 20:14 hashar: pooling integration-slave-trusty-1001 integration-slave-trusty-1002 integration-slave-trusty-1003
  • 19:35 marxarelli: modifying deployment server node in jenkins to point to deployment-tin
  • 19:27 thcipriani: running sudo salt -b '10%' '*' cmd.run 'puppet agent -t' from deployment-salt
  • 19:27 twentyafterfour: Keeping notes on the ticket: https://phabricator.wikimedia.org/T126537
  • 19:24 thcipriani: moving deployment-bastion to deployment-tin
  • 17:59 hashar: recreated instances with proper names: integration-slave-trusty-{1001-1006}
  • 17:52 hashar: Created integration-slave-trusty-{1019-1026} as m1.large (note 1023 is an exception it is for Android). Applied role::ci::slave , lets wait for puppet to finish
  • 17:42 Krinkle: Currently testing https://gerrit.wikimedia.org/r/#/c/268802/ in Beta Labs
  • 17:27 hashar: Depooling all the ci.medium slaves and deleting them.
  • 17:27 hashar: I tried. The ci.medium instances are too small and MediaWiki tests really need 1.5GBytes of memory :-(
  • 16:00 hashar: rebuilding integration-dev https://phabricator.wikimedia.org/T126613
  • 15:27 Krinkle: Deploy Zuul config change https://gerrit.wikimedia.org/r/269976
  • 11:46 hashar: salt -v '*' cmd.run '/etc/init.d/apache2 restart' might help for Wikidata browser tests failling
  • 11:32 hashar: disabling hhvm service on CI slaves ( https://phabricator.wikimedia.org/T126594 , cherry picked both patches )
  • 10:50 hashar: reenabled puppet on CI. All transitioned to a 128MB tmpfs (was 512MB)
  • 10:16 hashar: pooling back integration-slave-trusty-1009 and integration-slave-trusty-1010 (tmpfs shrunken)
  • 10:06 hashar: disabling puppet on all CI slaves. Trying to lower tmpfs 512MB to 128MB ( https://gerrit.wikimedia.org/r/#/c/269880/ )
  • 02:45 legoktm: deploying https://gerrit.wikimedia.org/r/269853 https://gerrit.wikimedia.org/r/269893

2016-02-10

  • 23:54 hashar_: depooling Trusty slaves that only have 2GB of ram that is not enough. https://phabricator.wikimedia.org/T126545
  • 22:55 hashar_: gallium: find /var/lib/jenkins/config-history/config -type f -wholename '*/2015*' -delete ( https://phabricator.wikimedia.org/T126552 )
  • 22:34 Krinkle: Zuul is back up and procesing Gerrit events, but jobs are still queued indefinitely. Jenkins is not accepting new jobs
  • 22:31 Krinkle: Full restart of Zuul. Seems Gearman/Zuul got stuck. All executors were idling. No new Gerrit events processed either.
  • 21:22 legoktm: cherry-picking https://gerrit.wikimedia.org/r/#/c/269370/ on integration-puppetmaster again
  • 21:17 hashar: CI dust have settled. Krinkle and I have pooled a lot more Trusty slaves to accommodate for the overload caused by switching to php55 (jobs run on Trusty)
  • 21:08 hashar: pooling trusty slaves 1009, 1010, 1021, 1022 with 2 executors (they are ci.medium)
  • 20:38 hashar: cancelling mediawiki-core-jsduck-publish and mediawiki-core-doxygen-publish jobs manually. They will catch up on next merge
  • 20:34 Krinkle: Pooled integration-slave-trusty-1019 (new)
  • 20:28 Krinkle: Pooled integration-slave-trusty-1020 (new)
  • 20:24 Krinkle: created integration-slave-trusty-1019 and integration-slave-trusty-1020 (ci1.medium)
  • 20:18 hashar: created integration-slave-trusty-1009 and 1010 (trusty ci.medium)
  • 20:06 hashar: creating integration-slave-trusty-1021 and integration-slave-trusty-1022 (ci.medium)
  • 19:48 greg-g: that cleanup was done by apergos
  • 19:48 greg-g: did cleanup across all integration slaves, some were very close to out of room. results: https://phabricator.wikimedia.org/P2587
  • 19:43 hashar: Dropping slaves Precise m1.large integration-slave-precise-1014 and integration-slave-precise-1013 , most load shifted to Trusty (php53 -> php55 transition)
  • 18:20 Krinkle: Creating a Trusty slave to support increased demand following MediaWIki php53(precise)>php55(trusty) bump
  • 16:06 jzerebecki: reloading zuul for 41a92d5..5b971d1
  • 15:42 jzerebecki: reloading zuul for 639dd40..41a92d5
  • 14:12 jzerebecki: recover a bit of disk space: integration-saltmaster:~# salt --show-timeout '*slave*' cmd.run 'rm -rf /mnt/jenkins-workspace/workspace/*WikibaseQuality*'
  • 13:46 jzerebecki: reloading zuul for 639dd40
  • 13:15 jzerebecki: reloading zuul for 3be81c1..e8e0615
  • 08:07 legoktm: deploying https://gerrit.wikimedia.org/r/269619
  • 08:03 legoktm: deploying https://gerrit.wikimedia.org/r/269613 and https://gerrit.wikimedia.org/r/269618
  • 06:41 legoktm: deploying https://gerrit.wikimedia.org/r/269607
  • 06:34 legoktm: deploying https://gerrit.wikimedia.org/r/269605
  • 02:59 legoktm: deleting 14GB broken workspace of mediawiki-core-php53lint from integration-slave-precise-1004
  • 02:37 legoktm: deleting /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm-composer on trusty-1017, it had a skin cloned into it
  • 02:26 legoktm: queuing mwext jobs server-side to identify failing ones
  • 02:21 legoktm: deploying https://gerrit.wikimedia.org/r/269582
  • 01:03 legoktm: deploying https://gerrit.wikimedia.org/r/269576

2016-02-09

  • 23:17 legoktm: deploying https://gerrit.wikimedia.org/r/269551
  • 23:02 legoktm: gracefully restarting zuul
  • 22:57 legoktm: deploying https://gerrit.wikimedia.org/r/269547
  • 22:29 legoktm: deploying https://gerrit.wikimedia.org/r/269540
  • 22:18 legoktm: re-enabling puppet on all CI slaves
  • 22:02 legoktm: reloading zuul to see if it'll pickup the new composer-php53 job
  • 21:53 legoktm: enabling puppet on just integration-slave-trusty-1012
  • 21:52 legoktm: cherry-picked https://gerrit.wikimedia.org/r/#/c/269370/ onto integration-puppetmaster
  • 21:50 legoktm: disabling puppet on all trusty/precise CI slaves
  • 21:40 legoktm: deploying https://gerrit.wikimedia.org/r/269533
  • 17:49 marxarelli: disabled/enabled gearman in jenkins, connection works this time
  • 17:49 marxarelli: performed stop/start of zuul on gallium to restore zuul and gearman
  • 17:45 marxarelli: "Failed: Unable to Connect" in jenkins when testing gearman connection
  • 17:40 marxarelli: killed old zull process manually and restarted service
  • 17:39 marxarelli: restart of zuul fails as well. old process cannot be killed
  • 17:38 marxarelli: reloading zuul fails with "failed to kill 13660: Operation not permitted"
  • 16:06 bd808: Deleted corrupt integration-slave-precise-1003:/mnt/jenkins-workspace/workspace/mediawiki-core-php53lint/.git
  • 15:11 hashar: mira: /srv/mediawiki-staging/multiversion/checkoutMediaWiki 1.27.0-wmf.13 php-1.27.0-wmf.13
  • 14:51 hashar: ./make-wmf-branch -n 1.27.0-wmf.13 -o master
  • 14:50 hashar: pooling back integration-slave-precise1001 - 1004. Manually fetched git repos in workspace for mediawiki core php53
  • 14:49 hashar: make-wmf-branch instance: created a local ssh key pair and set the config to use User: hashar
  • 14:13 hashar: pooling https://integration.wikimedia.org/ci/computer/integration-slave-precise-1012/ Mysql is back .. Blame puppet
  • 14:12 hashar: de pooling https://integration.wikimedia.org/ci/computer/integration-slave-precise-1012/ Mysql is gone somehow
  • 14:04 hashar: Manually git fetching mediawiki-core in /mnt/jenkins-workspace/workspace/mediawiki-core-php53lint of slaves precise 1001 to 1004 (git on Precise is remarkably too slow)
  • 13:28 hashar: salt '*trusty*' cmd.run 'update-alternatives --set php /usr/bin/hhvm'
  • 13:28 hashar: salt '*precise*' cmd.run 'update-alternatives --set php /usr/bin/php5'
  • 13:18 hashar: salt -v --batch=3 '*slave*' cmd.run 'puppet agent -tv'
  • 13:15 hashar: removing https://gerrit.wikimedia.org/r/#/c/269370/ from CI puppet master
  • 13:14 hashar: slave recurse infinitely doing /bin/bash -eu /srv/deployment/integration/slave-scripts/bin/mw-install-mysql.sh then loop over /bin/bash /usr/bin/php maintenance/install.php --confpath /mnt/jenkins-workspace/workspace/mediawiki-core-qunit/src --dbtype=mysql --dbserver=127.0.0.1:3306 --dbuser=jenkins_u2 --dbpass=pw_jenkins_u2 --dbname=jenkins_u2_mw --pass testpass TestWiki WikiAdmin https://phabricator.wikimedia.org/T126327
  • 12:46 hashar: Mass testing php loop of death: salt -v '*slave*' cmd.run 'timeout 2s /srv/deployment/integration/slave-scripts/bin/php --version'
  • 12:40 hashar: mass rebooting CI slaves from wikitech
  • 12:39 hashar: salt -v '*' cmd.run "bash -c 'cd /srv/deployment/integration/slave-scripts; git pull'"
  • 12:33 hashar: all slaves dieing due to PHP looping
  • 12:02 legoktm: re-enabling puppet on all trusty/precise slaves
  • 11:20 legoktm: cherry-picked https://gerrit.wikimedia.org/r/#/c/269370/ on integration-puppetmaster
  • 11:20 legoktm: enabling puppet just on integration-slave-trusty-1012
  • 11:13 legoktm: disabling puppet on all *(trusty|precise)* slaves
  • 10:26 hashar: pooling in integration-slave-trusty-1018
  • 03:19 legoktm: deploying https://gerrit.wikimedia.org/r/269359
  • 02:53 legoktm: deploying https://gerrit.wikimedia.org/r/238988
  • 00:39 hashar: gallium edited /usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/trigger/gerrit.py and modified: replication_timeout = 300 -> replication_timeout = 10
  • 00:37 hashar: live hacking Zuul code to have it stop sleeping() on force merge
  • 00:36 hashar: killing zuul

2016-02-08

2016-02-06

  • 18:34 jzerebecki: reloading zuul for bdb2ed4..46ccca9

2016-02-05

  • 13:30 hashar: beta cleaning out /data/project/logs/archive was from pre logstash area. We no more log this way since May 2015 apparently
  • 13:29 hashar: beta deleting /data/project/swift-disk created in august 2014 , unused since june 2015. Was a fail attempt at bringing swift to beta
  • 13:27 hashar: beta: reclaiming disk space from extensions.git. On bastion: find /srv/mediawiki-staging/php-master/extensions/.git/modules -maxdepth 1 -type d -print -execdir git gc \;
  • 13:03 hashar: integration-slave-trusty-1011 went out of disk space. Did some brute clean up and git gc.
  • 05:21 Tim: configured mediawiki-extensions-qunit to only run on integration-slave-trusty-1017, did a rebuild and then switched it back

2016-02-04

  • 22:08 jzerebecki: reloading zuul for bed7be1..f57b7e2
  • 21:51 hashar: salt-key -d integration-slave-jessie-1001.eqiad.wmflabs
  • 21:50 hashar: salt-key -d integration-slave-precise-1011.eqiad.wmflabs
  • 00:57 bd808: Got deployment-bastion processing Jenkins jobs again via instructions left by my past self at https://phabricator.wikimedia.org/T72597#747925
  • 00:43 bd808: Jenkins agent on deployment-bastion.eqiad doing the trick where it doesn't pick up jobs again

2016-02-03

  • 22:24 bd808: Manually ran sync-common on deployment-jobrunner01.eqiad.wmflabs to pickup wmf-config changes that were missing (InitializeSettings, Wikibase, mobile)
  • 17:43 marxarelli: Reloading Zuul to deploy previously undeployed Icd349069ec53980ece2ce2d8df5ee481ff44d5d0 and Ib18fe48fe771a3fe381ff4b8c7ee2afb9ebb59e4
  • 15:12 hashar: apt-get upgrade deployment-sentry2
  • 15:03 hashar: redeployed rcstream/rcstream on deployment-stream by using git-deploy on deployment-bastion
  • 14:55 hashar: upgrading deployment-stream
  • 14:42 hashar: pooled back integration-slave-trusty-1015 Seems ok
  • 14:35 hashar: manually triggered a bunch of browser tests jobs
  • 11:40 hashar: apt-get upgrade deployment-ms-be01 and deployment-ms-be02
  • 11:32 hashar: fixing puppet.conf on deployment-memc04
  • 11:09 hashar: restarting beta cluster puppetmaster just in case
  • 11:07 hashar: beta: apt-get upgrade on delpoyment-cache* hosts and checking puppet
  • 10:59 hashar: integration/beta: deleting /etc/apt/apt.conf.d/*proxy files. There is no need for them, in fact web proxy is not reachable from labs
  • 10:53 hashar: integration: switched puppet repo back to 'production' branch, rebased.
  • 10:49 hashar: various beta cluster have puppet errors ..
  • 10:46 hashar: integration-slave-trusty-1013 heading to out of disk space on /mnt ...
  • 10:42 hashar: integration-slave-trusty-1016 out of disk space on /mnt ...
  • 03:45 bd808: Puppet failing on deployment-fluorine with "Error: Could not set uid on user[datasets]: Execution of '/usr/sbin/usermod -u 10003 datasets' returned 4: usermod: UID '10003' already exists"
  • 03:44 bd808: Freed 28G by deleting deployment-fluorine:/srv/mw-log/archive/*2015*
  • 03:42 bd808: Ran deployment-bastion.deployment-prep:/home/bd808/cleanup-var-crap.sh and freed 565M

2016-02-02

  • 18:32 marxarelli: Reloading Zuul to deploy If1f3cb60f4ccb2c1bca112900dbada03a8588370
  • 17:42 marxarelli: cleaning mwext-donationinterfacecore125-testextension-php53 workspace on integration-slave-precise-1013
  • 17:06 ostriches: running sync-common on mw2051 and mw1119
  • 09:38 hashar: Jenkins is fully up and operational
  • 09:33 hashar: restarting Jenkins
  • 08:47 hashar: pooling back integration-slave-precise1011 , puppet run got fixed ( https://phabricator.wikimedia.org/T125474 )
  • 03:48 legoktm: deploying https://gerrit.wikimedia.org/r/267828
  • 03:29 legoktm: deploying https://gerrit.wikimedia.org/r/266941
  • 00:42 legoktm: due to T125474
  • 00:42 legoktm: marked integration-slave-precise-1011 as offline
  • 00:39 legoktm: precise-1011 slave hasn't had a puppet run in 6 days

2016-02-01

  • 23:53 bd808: Logstash working again; I applied a change to the default mapping template for Elasticsearch that ensures that fields named "timestamp" are indexed as plain strings
  • 23:46 bd808: Elasticsearch index template for beta logstash cluster making crappy guesses about syslog events; dropped 2016-02-01 index; trying to fix default mappings
  • 23:09 bd808: HHVM logs causing rejections during document parse when inserting in Elasticsearch from logstash. They contain a "timestamp" field that looks like "Feb 1 22:56:39" which is making the mapper in Elasticsearch sad.
  • 23:04 bd808: Elasticsearch on deployment-logstash2 rejecting all documents with 400 status. Investigating
  • 22:50 bd808: Copying deployment-logstash2.deployment-prep:/var/log/logstash/logstash.log to /srv for debugging later
  • 22:48 bd808: deployment-logstash2.deployment-prep:/var/log/logstash/logstash.log is 11G of fail!
  • 22:46 bd808: root partition on deployment-logstash2 full
  • 22:43 bd808: No data in logstash since 2016-01-30T06:55:37.838Z; investigating
  • 15:33 hashar: Image ci-jessie-wikimedia-1454339883 in wmflabs-eqiad is ready
  • 15:01 hashar: Refreshing Nodepool image. Might have npm/grunt properly set up
  • 03:15 legoktm: deploying https://gerrit.wikimedia.org/r/267630

2016-01-31

  • 13:35 hashar: Jenkins IRC bot started falling at Jan 30 01:04:00 2016 for whatever reason.... Should be fine now
  • 13:33 hashar: cancelling/aborting jobs that are stuck while reporting to IRC (mostly browser tests and beta cluster jobs)
  • 13:32 hashar: Jenkins jobs are being blocked because they can no more report back to IRC :-(((
  • 13:28 hashar: Jenkins jobs are being blocked because they can no more report back to IRC :-(((

2016-01-30

  • 12:46 hashar: integration-slave-jessie-1001 : fixed puppet.con server name and ran puppet

2016-01-29

  • 18:43 thcipriani: updated scap on beta
  • 16:44 thcipriani: deployed scap updates on beta
  • 11:58 _joe_: upgraded hhvm to 3.6 wm8 in deployment-prep

2016-01-28

  • 23:22 MaxSem: Updated portals on betalabs to master
  • 22:23 hashar: salt '*slave-precise*' cmd.run 'apt-get install php5-ldap' ( https://phabricator.wikimedia.org/T124613 ) will need to be puppetized
  • 18:17 thcipriani: cleaning npm cache on slave machines: salt -v '*slave*' cmd.run 'sudo -i -u jenkins-deploy -- npm cache clean'
  • 18:12 thcipriani: running npm cache clean on integration-slave-precise-1011 sudo -i -u jenkins-deploy -- npm cache clean
  • 15:25 hashar: apt-get upgrade deployment-sca01 and deployment-sca02
  • 15:09 hashar: fixing puppet.conf hostname on deployment-upload deployment-conftool deployment-tmh01 deployment-zookeeper01 and deployment-urldownloader
  • 15:06 hashar: fixing puppet.con hostname on deployment-upload.deployment-prep.eqiad.wmflabs and running puppet
  • 15:00 hashar: Running puppet on deployment-memc02 and deployment-elastic07 . It is catching up with lot of changes
  • 14:59 hashar: fixing puppet hostnames on deployment-elastic07
  • 14:59 hashar: fixing puppet hostnames on deployment-memc02
  • 14:55 hashar: Deleted salt keys deployment-pdf01.eqiad.wmflabs and deployment-memc04.eqiad.wmflabs (obsolete, entries with '.deployment-prep.' are already there)
  • 07:38 jzerebecki: reload zuul for 4951444..43a030b
  • 05:55 jzerebecki: doing https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Hung_beta_code.2Fdb_update
  • 03:49 mobrovac: deployment-prep re-enabled puppet on deployment-restbase0x
  • 02:49 mobrovac: deployment-prep deployment-restbase01 disabled puppet to set up cassandra for
  • 02:27 mobrovac: deployment-prep recreating deployment-restbase01 for T125003
  • 02:23 mobrovac: deployment-prep deployment-restbase02 disabled puppet to recreate deployment-restbase01 for T125003
  • 01:42 mobrovac: deployment-prep recreating deployment-sca02 for T125003
  • 01:28 mobrovac: deployment-prep recreating deployment-sca01 for T125003
  • 00:36 mobrovac: deployment-prep re-imaging deployment-mathoid for T125003
  • 00:02 jzerebecki: integration-slave-trusty-1016:~$ sudo -i rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm/src/skins/Donate

2016-01-27

  • 23:49 jzerebecki: integration-slave-precise-1011:~$ sudo -i /etc/init.d/salt-minion restart
  • 23:46 jzerebecki: work around https://phabricator.wikimedia.org/T117710 : salt --show-timeout '*slave*' cmd.run 'rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm/src/skins/BlueSky'
  • 21:19 cscott: updated OCG to version 64050af0456a43344b32e3e93561a79207565eaf (should be no-op after yesterday's deploy)
  • 10:29 hashar: triggered bunch of browser tests, deployment-redis01 was dead/faulty
  • 10:08 hashar: mass restarting redis-server process on deployment-redis01 (for https://phabricator.wikimedia.org/T124677 )
  • 10:07 hashar: mass restarting redis-server process on deployment-redis01
  • 09:00 hashar: beta: commenting out "latency-monitor-threshold 100" parameter from any /etc/redis/redis.conf we have ( https://phabricator.wikimedia.org/T124677 ). Puppet will not reapply it unless distribution is Jessie

2016-01-26

  • 16:51 cscott: updated OCG to version 64050af0456a43344b32e3e93561a79207565eaf
  • 12:14 hashar: Added Jenkins IRC bot (wmf-insecte) to #wikimedia-perf for https://gerrit.wikimedia.org/r/#/c/265631/
  • 09:30 hashar: restarting Jenkins to upgrade the gearman plugin with https://review.openstack.org/#/c/271543/
  • 04:18 bd808: integration-slave-jessie-1001:/mnt full; cleaned up 15G of files in /mnt/pbuilder/build (27 hours after the last time I did that)

2016-01-25

  • 18:59 twentyafterfour: started redis-server on deployment-redis01 by commenting out latency-monitor-threshold from the redis.conf
  • 15:22 hashar: CI: fixing kernels not upgrading via: rm /boot/grub/menu.lst ; update-grub -y (i.e.: regenerate the Grub menu from scratch)
  • 14:21 hashar: integration-slave-trusty-1015.integration.eqiad.wmflabs is gone. I have failed the kernel upgrade / grub update
  • 01:35 bd808: integration-slave-jessie-1001:/mnt full; cleaned up 15G of files in /mnt/pbuilder/build

2016-01-24

2016-01-22

  • 23:58 legoktm: removed skins from mwext-qunit workspace on trusty-1013 slave
  • 23:34 legoktm: rm -rf /mnt/jenkins-workspace/workspace/mediawiki-phpunit-php53 on slave precise 1012
  • 22:45 legoktm: deploying https://gerrit.wikimedia.org/r/265864
  • 22:27 hashar: rebooted all CI slaves using OpenStackManager
  • 22:09 hashar: rebooting deployment-redis01 (kernel upgrade)
  • 21:22 hashar: Image ci-jessie-wikimedia-1453497269 in wmflabs-eqiad is ready (with node 4.2 for https://phabricator.wikimedia.org/T119143 )
  • 21:14 hashar: updating nodepool snapshot based on new image
  • 21:12 hashar: rebuilding nodepool reference image
  • 20:04 hashar: Image ci-jessie-wikimedia-1453492820 in wmflabs-eqiad is ready
  • 20:00 hashar: Refreshing nodepool image to hopefully get Nodejs 4.2.4 https://phabricator.wikimedia.org/T124447 https://gerrit.wikimedia.org/r/#/c/265802/
  • 16:32 hashar: Nuked corrupted git repo on integration-slave-precise-1012 /mnt/jenkins-workspace/workspace/mediawiki-extensions-php53
  • 12:23 hashar: beta: reinitialized keyholder on deployment-bastion. The proxy apparently had no identity
  • 09:32 hashar: beta cluster Jenkins job have been stalled for 9hours and 25 minutes. Disabling/reenabling the Gearman plugin to remove the deadlock

2016-01-21

  • 21:41 hashar: restored role::mail::mx on deployment-mx
  • 21:36 hashar: dropping role::mail::mx from deployment-mx to let puppet run
  • 21:33 hashar: rebooting deployment-jobrunner01 / kernel upgrade / /tmp is only 1MBytes
  • 21:19 hashar: fixing up deployment-jobrunner01 /tmp and / disks are full
  • 19:57 thcipriani: ran REPAIR TABLE globalnames; on centralauth db
  • 19:48 legoktm: deploying https://gerrit.wikimedia.org/r/265552
  • 19:39 legoktm: deploying jjb changes for https://gerrit.wikimedia.org/r/264990
  • 19:25 legoktm: deploying https://gerrit.wikimedia.org/r/265546
  • 01:59 jzerebecki: jenkins-deploy@deployment-bastion:/srv/mediawiki-staging/php-master/extensions/SpellingDictionary$ rm -r modules/jquery.uls && git rm modules/jquery.uls
  • 01:00 jzerebecki: jenkins-deploy@deployment-bastion:/srv/mediawiki-staging/php-master/extensions$ git pull && git submodule update --init --recursive
  • 00:57 jzerebecki: jenkins-deploy@deployment-bastion:/srv/mediawiki-staging/php-master/extensions$ git reset HEAD SpellingDictionary

2016-01-20

  • 20:05 hashar: beta sudo find /data/project/upload7/math -type f -delete (probably some old left over)
  • 19:50 hashar: beta: on commons ran deleteArchivedFile.php : Nuked 7130 files
  • 19:49 hashar: beta : foreachwiki deleteArchivedRevisions.php -delete
  • 19:26 hasharAway: Nuked all files from http://commons.wikimedia.beta.wmflabs.org/wiki/Category:GWToolset_Batch_Upload
  • 19:19 hasharAway: beta: sudo find /data/project/upload7/*/*/temp -type f -delete
  • 19:14 hasharAway: beta: sudo rm /data/project/upload7/*/*/lockdir/*
  • 18:57 hasharAway: beta cluster code has been stalled for roughly 2h30
  • 18:55 hasharAway: disconnecting Gearman plugin to remove deadlock for beta cluster rjobs
  • 17:06 hashar: clearing files from beta-cluster to prepare for Swift migration. python pwb.py delete.py -family:betacommons -lang:en -cat:'GWToolset Batch Upload' -verbose -putthrottle:0 -summary:'Clearing out old batched upload to save up disk space for Swift migration'

2016-01-19

2016-01-17

2016-01-16

2016-01-15

  • 12:17 hashar: restarting Jenkins for plugins updates
  • 02:49 bd808: Trying to fix submodules in deployment-bastion:/srv/mediawiki-staging/php-master/extensions for T123701

2016-01-14

2016-01-13

  • 21:06 hashar: beta cluster code is up to date again. Got delayed by roughly 4 hours.
  • 20:55 hashar: unlocked Jenkins jobs for beta cluster by disabling/reenabling Jenkins Gearman client
  • 10:15 hashar: beta: fixed puppet on deployment-elastic06 . Was still using cert/hostname without .deployment-prep. .... Mass update occurring.

2016-01-12

2016-01-11

  • 22:24 hashar: Deleting old references on Zuul-merger for mediawiki/core : /usr/share/python/zuul/bin/python /home/hashar/zuul-clear-refs.py --until 15 /srv/ssd/zuul/git/mediawiki/core
  • 22:21 hashar: gallium in /srv/ssd/zuul/git/mediawiki/core$ git gc --prune=all && git remote update --prune
  • 22:21 hashar: scandium in /srv/ssd/zuul/git/mediawiki/core$ git gc --prune=all && git remote update --prune
  • 07:35 legoktm: deploying https://gerrit.wikimedia.org/r/263319

2016-01-07

2016-01-06

  • 21:13 thcipriani: kicking integration puppetmaster, weird node unable to find definition.
  • 21:11 jzerebecki: on scandium: sudo -u zuul rm -rf /srv/ssd/zuul/git/mediawiki/services/mathoid
  • 21:04 legoktm: ^ on gallium
  • 21:04 legoktm: manually deleted /srv/ssd/zuul/git/mediawiki/services/mathoid to force zuul to re-clone it
  • 20:17 hashar: beta: dropped a few more /etc/apt/apt.conf.d/*-proxy files. webproxy is no more reachable from labs
  • 09:44 hashar: CI/beta: deleting all git tags from /var/lib/git/operations/puppet and doing git repack
  • 09:39 hashar: restoring puppet hacks on beta cluster puppetmaster.
  • 09:35 hashar: beta/CI: salt -v '*' cmd.run 'rm -v /etc/apt/apt.conf.d/*-proxy' https://phabricator.wikimedia.org/T122953

2016-01-05

2016-01-04

2016-01-02

  • 03:17 yurik: purged varnishs on deployment-cache-text04

2016-01-01

  • 22:17 bd808: No nodepool ci-jessie-* hosts seen in Jenkins interface and rake-jessie jobs backing up

Archives