Release Engineering/SAL/Archive 2

2016-12-27

05:00 Amir1: deploying 5230e7d in ores beta node (T154168)

2016-12-26

12:09 hashar: beta: restarted varnish.service and varnish-frontend.service on deployment-cache-text04

2016-12-24

09:02 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/329038

2016-12-23

12:18 legoktm: deploying https://gerrit.wikimedia.org/r/328886

2016-12-22

22:11 thcipriani: disable production l10nupdate for deployment freeze

2016-12-21

05:57 Krinkle: Jenkins "Collapsing Console Sections" for PHPUnit was broken since "-d zend.enable_gc=0" was added to phpunit.php invocation. Updated pattern in Jenkins system configuration.

2016-12-19

21:21 andrewbogott: and also python-functools32_3.2.3.2-3~bpo8+1_all.deb
21:20 andrewbogott: upgrading to python-jsonschema_2.5.1-5~bpo8+1_all.deb on deployment-eventlogging03
20:51 andrewbogott: upgrading to python-requests_2.12.3-1_all.deb ./python-urllib3_1.19.1-1_all.deb on deployment-mediawiki04 and deployment-tin
09:35 legoktm: deploying https://gerrit.wikimedia.org/r/328145
08:00 legoktm: deploying https://gerrit.wikimedia.org/r/288819 https://gerrit.wikimedia.org/r/276065 https://gerrit.wikimedia.org/r/328136
02:25 legoktm: deploying https://gerrit.wikimedia.org/r/327692

2016-12-16

22:34 legoktm: deploying https://gerrit.wikimedia.org/r/327202
14:33 hashar: Nodepool Image ci-jessie-wikimedia-1481897950 in wmflabs-eqiad is ready
14:25 hashar: Nodepool Image ci-trusty-wikimedia-1481897961 in wmflabs-eqiad is ready
14:19 hashar: Refreshing Nodepool images. The snapshots were broken due to mariadb-client failing to upgrade
13:45 hashar: integration / contintcloud : remove security rules of labs projects that allowed gallium (phased out) T95757
13:44 hashar: integration / contintcloud : update security rules of labs projects to allow contint2001
13:15 hashar: integration: update sudo policy for debian-glue to keep the env variable SHELL_ON_FAILURE (for https://gerrit.wikimedia.org/r/#/c/327720/ )
10:15 hashar: integration: apt-get upgrade on all permanent slaves
10:13 hashar: integration-slave-docker-1000 changed docker::version from no more existent '1.12.3-0~jessie' to simply 'present'. Will have to manually upgrade it from now on. T153419
10:04 hashar: deployment-puppetmaster02 updated puppet repo. Was stall due to a bump of the mariadb submodule

2016-12-15

21:00 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/324368
19:23 marxarelli: Manually rebasing and re-applying cherry picks for operations/puppet on integration-puppetmaster01.eqiad.wmflabs
16:08 hashar: deployment-phab02 : apt-get upgrade T147818
14:48 Amir1: ladsgroup@deployment-tin:~$ mwscript updateCollation.php --wiki=fawiki (T139110)
11:41 zeljkof: Reloading Zuul to deploy 327473

2016-12-14

12:38 elukey: created deployment-copper on deployment-prep as temporary test

2016-12-13

22:52 thcipriani: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/327119
21:15 thcipriani: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/327048
09:42 hashar: Updating MediaWiki Jenkins jobs to support injecting skin dependencies T151593
02:17 legoktm: deploying https://gerrit.wikimedia.org/r/326880
02:10 legoktm: deploying https://gerrit.wikimedia.org/r/326877

2016-12-09

04:01 legoktm: deploying https://gerrit.wikimedia.org/r/326070
03:45 legoktm: deploying https://gerrit.wikimedia.org/r/326069

2016-12-08

23:35 legoktm: deploying https://gerrit.wikimedia.org/r/326048 https://gerrit.wikimedia.org/r/326050
22:32 legoktm: deploying https://gerrit.wikimedia.org/r/325930
21:14 legoktm: deploying https://gerrit.wikimedia.org/r/326032
21:08 legoktm: deploying https://gerrit.wikimedia.org/r/326020
20:27 legoktm: deploying https://gerrit.wikimedia.org/r/325974
20:19 legoktm: deploying https://gerrit.wikimedia.org/r/326016
20:11 legoktm: deploying https://gerrit.wikimedia.org/r/326015
19:51 legoktm: deploying https://gerrit.wikimedia.org/r/326009
19:44 legoktm: deploying https://gerrit.wikimedia.org/r/325912 https://gerrit.wikimedia.org/r/326006
15:33 hashar: Image ci-jessie-wikimedia-1481210905 in wmflabs-eqiad is ready : Notice: /Stage[main]/Main/Package[netcat-openbsd]/ensure: ensure changed 'purged' to 'present'
15:28 hashar: Updating Nodepool Jessie image to ship `netcat` T151469 T152684
10:31 hashar: Image ci-trusty-wikimedia-1481192772 in wmflabs-eqiad is ready
10:21 hashar: Refreshing Nodepool base image for Trusty. Was blocked on a mariadb upgrade, should also acquire network faster T113342
09:45 legoktm: deploying https://gerrit.wikimedia.org/r/325903
08:48 hashar: Image ci-jessie-wikimedia-1481186016 in wmflabs-eqiad is ready T113342
05:31 legoktm: legoktm@integration-saltmaster:~$ sudo salt '*jessie*' cmd.run 'puppet agent -tv'
05:26 legoktm: cherry-picked https://gerrit.wikimedia.org/r/#/c/325877/ onto integration-puppetmaster01
03:26 legoktm: deploying https://gerrit.wikimedia.org/r/325873

2016-12-07

15:04 hashar: Image ci-trusty-wikimedia-1481122712 in wmflabs-eqiad is ready T117418
02:29 matt_flaschen: foreachwikiindblist FlowFixInconsistentBoards complete
02:27 matt_flaschen: Started (foreachwikiindblist flow.dblist extensions/Flow/maintenance/FlowFixInconsistentBoards.php) 2>&1 | tee FlowFixInconsistentBoards_2016-12-06.txt on deployment-tin

2016-12-06

21:20 hashar: Image ci-jessie-wikimedia-1481058839 in wmflabs-eqiad is ready T113342
21:13 hashar: Refresh Nodepool Jessie snapshot which boot 3 times faster. Will help get nodes available faster T113342
16:33 hashar: Nodepool imported a new Jessie image 'jessie-T113342' with some network configuration hotfix. Will use for debugging. T113342
09:08 Reedy: running foreachwiki update.php on beta

2016-12-05

20:43 hashar: Image ci-jessie-wikimedia-1480969940 in wmflabs-eqiad is ready (include trendingedits::packages which explicitly define the installation of librdkafka-dev' )
09:52 elukey: add https://gerrit.wikimedia.org/r/#/c/324642/ to the deployment-prep's puppet master to test nutcracker
09:39 hashar: beta-update-databases-eqiad fails due to CONTENT_MODEL_FLOW_BOARD not registered on the wiki. T152379
08:44 hashar: Image ci-jessie-wikimedia-1480926961 in wmflabs-eqiad is ready T113342
08:35 hashar: Pushing new Jessie image to Nodepool that is supposedly boot 3x times faster T113342

2016-12-04

15:25 Krenair: Found a git-sync-upstream cron on deployment-mx for some reason... commented for now, but wtf was this doing on a MX server?

2016-12-03

23:07 legoktm: deploying https://gerrit.wikimedia.org/r/325132
10:48 legoktm: deploying https://gerrit.wikimedia.org/r/325093 and https://gerrit.wikimedia.org/r/325094

2016-12-02

14:40 hashar: added Tobias Gritschacher to Gerrit "integration" group so he can +2 patches on integration/* repositories \O/

2016-12-01

18:20 elukey: removing https://gerrit.wikimedia.org/r/#/c/305536 from the puppet master via rebase -i (no-op for beta)
18:11 elukey: adding https://gerrit.wikimedia.org/r/#/c/305536/3 to the puppet master
14:16 hashar: Image ci-jessie-wikimedia-1480601060 in wmflabs-eqiad is ready | T152096

2016-11-30

17:22 gehel: restart of logstash on deployment-logstash2 - upgrade to Java 8 - T151325
17:11 gehel: rolling restart of deployment-elastic0* - upgrade to Java 8 - T151325
11:22 hashar: Gerrit hide mediawiki/extensions/JsonData/JsonSchema Empty since 2013
11:20 hashar: Gerrit made mediawiki/extensions/GuidedTour/guiders read-only (per README.md, no more used)
11:18 hashar: Gerrit mediawiki/extensions/CentralNotice/BannerProxy.git Empty since 2014

2016-11-29

15:23 hashar: Image ci-jessie-wikimedia-1480432368 in wmflabs-eqiad is ready
14:30 hashar: Image ci-trusty-wikimedia-1480429423 in wmflabs-eqiad is ready T151879
14:24 hashar: Refreshing Nodepool Trusty snapshot to get php5-xsl installed T151879

2016-11-28

09:48 hashar: Image ci-trusty-wikimedia-1480326016 in wmflabs-eqiad is ready
09:39 hashar: Regenerated Nodepool image for Trusty. It no more includes apache::mod::php5 which broke the build and is not needed on Trusty ( https://gerrit.wikimedia.org/r/323803 )
09:15 elukey: cherry-pick of https://gerrit.wikimedia.org/r/#/c/323517 to deployment-puppetmaster02 to test

2016-11-26

16:15 Reedy: killed /srv/jenkins-workspace/workspace/mediawiki-core-*/src and /srv/jenkins-workspace/workspace/mwext-*/src from integration slaves to get rid of borked MW dirs
15:51 Reedy: deleted /srv/jenkins-workspace/workspace/mediawiki-core-code-coverage/src on integration-slave-trusty-1006 to force a reclone
14:14 Reedy: moved old /srv/mediawiki-staging/php-master to /tmp/php-master, recloned MW Core, copied in LocalSettings, skins, vendor and extensions. T151676. scap sync-dir running
13:05 Reedy: marked deployment-tin as offline due to T151670

2016-11-24

20:49 hashar: make contint1001 Jenkins slave to only builds jobs with a label matching the node https://integration.wikimedia.org/ci/computer/contint1001/configure T86659
15:46 elukey: removing https://gerrit.wikimedia.org/r/#/c/322268/ from the list of cherry picks on puppet master since it is not the right way to go
08:58 elukey: rebased puppet operations git repo on deployment-puppetmaster to refresh https://gerrit.wikimedia.org/r/#/c/322268/

2016-11-23

15:04 Krenair: fixed puppet on deployment-cache-text04 by manually enabling experimental apt repo, see T150660
10:57 hashar: Terminating deployment-apertium01 again T147210

2016-11-22

19:31 hashar: beta: rebased puppet master
19:30 hashar: beta: dropping cherry pick for the PDF render by mobrovac ( https://gerrit.wikimedia.org/r/#/c/305256/ ). Got merged
08:29 hashar: Deleting shut off instances: integration-puppetmaster , deployment-puppetmaster , deployment-pdf02 , deployment-conftool - T150339

2016-11-21

12:46 hashar: beta: Cherry picked puppet fix for udp2log https://gerrit.wikimedia.org/r/#/c/322639/ T151169

2016-11-19

00:10 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/322370

2016-11-18

15:42 elukey: cherry picked https://gerrit.wikimedia.org/r/#/c/322268 on puppet master

2016-11-17

22:07 mutante: re-enabled puppet on contint1001 after live Apache fix
11:34 hasharLunch: Deleted instance deployment-apertium01 . Was Trusty and lacked packages, replaced by a Jessie one ages ago. T147210

2016-11-16

20:53 elukey: restored apache2 config on deployment-mediawiki06
20:28 elukey: temporary increasing verbosity of mod_rewrite on deployment-mediawiki06 as test
20:02 Krenair: mysql master back up, root identity is now unix socket based rather than password
19:57 Krenair: taking mysql master down to fix perms
13:02 hashar: Restarted HHVM on deployment-mediawiki05 was not honoring requests T150849
12:24 hashar: beta: created dewiktionary table on the Database slave. Restarted replication with START SLAVE; T150834 T150764
10:39 hashar: Removing revert b47ce21 from deployment-tin and reenabling jenkins job. https://gerrit.wikimedia.org/r/321857 will get it fixed
10:26 hashar: Reverting mediawiki/core b47ce21 on beta cluster T150833
09:51 hashar: marking deployment-tin offline so I can live hack mediawiki code / scap for T150833 and T15034
09:12 hashar: deployment-mediawiki04 stopping hhvm
09:12 hashar: deployment-mediawiki04 stopping hhv
08:59 hashar: beta database update broken with: MediaWiki 1.29.0-alpha Updater\n\nYour composer.lock file is up to date with current dependencies!
07:52 Krenair: the new mysql root password for -db04 is at /tmp/newmysqlpass as well as in a new file in the puppetmaster's labs/private.git
06:34 twentyafterfour: restarting hhvm on deployment-mediawiki04
06:33 Amir1: ladsgroup@deployment-mediawiki05:~$ sudo service hhvm restart
06:30 mutante: restarting hhvm on deployment-mediawiki06

2016-11-15

16:03 hasharAway: adding thcipriani to the labs "git" project maintained by paladox

2016-11-14

08:16 Amir1: cherry-picking 321096/3 in beta puppetmaster

2016-11-12

14:02 Amir1: cherry-picked gerrit change 321096/2 in puppetmaster

2016-11-11

23:48 bd808: Updated _template/logstash on deployment-logstash2 to include change from https://gerrit.wikimedia.org/r/#/c/320441/
23:44 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/320441/ for testing on deployment-logstash2
21:27 hashar: deployment-tin deleted /var/lock/scap . Was left over after beta-scap-eqiad job got abruptly aborted

2016-11-10

09:33 hashar: Image ci-jessie-wikimedia-1478770026 in wmflabs-eqiad is ready
09:26 hashar: Regenerate Nodepool base image for Jessie and refreshing snapshot image

2016-11-09

20:27 Krenair: removed default SSH access from production host 208.80.154.135, the old gallium IP
16:34 Reedy: deployment-tin no longer offline, jenkins running jobs now
16:11 Reedy: marking deployment-tin.eqiad as offline to test -labs -> beta config rename

2016-11-08

10:23 hashar: refreshing all jenkins jobs to clear out potential live hack I made but can't remember on which jobs I did

2016-11-07

14:01 gilles: Pointing deployment-imagescaler01.eqiad.wmflabs' puppet to puppetmaster.thumbor.eqiad.wmflabs

2016-11-04

13:20 hashar: gerrit: created mediawiki/extensions/PageViewInfo.git and renamed user group extension-WikimediaPageViewInfo to extension-PageViewInfo T148775
12:57 hashar: Image ci-jessie-wikimedia-1478263647 in wmflabs-eqiad is ready (bring in java for maven projects)
12:49 dcausse: deployment-prep reloading nginx on deployment-elastic0[5-7] to fix ssl cert issue
09:28 hashar: Delete integration-slave-jessie-1003 , only have a few jobs running on permanent Jessie slaves - T148183
09:26 hashar: Delete zuul-dev-jessie.integration.eqiad.wmflabs was for testing Zuul on Jessie and it works just fine on contint1001 :] T148183
09:25 hashar: Delete integration-slave-trusty-1012 one less permanent slave since some load has been moved to Nodepool T148183
09:24 hashar: Delete integration-slave-trusty-1016 not pooled in Jenkins anymore T148183

2016-11-03

15:05 Amir1: deploy 0caa589 in ores to deployment-sca03
14:52 Amir1: deploying ores 0caa589 in deployment-sca03
11:32 hashar: deployment-apertium01 manually cleared puppet.conf
11:29 hashar: deployment-apertium01 fails puppet du to wrong certificate bah
07:22 Krenair: fiddled with jenkins jobs in mediawiki-core-doxygen-publish to try to get stuff moving in the postmerge queue again
05:04 Krenair: beginning to move the rest of beta to the new puppetmaster
01:53 mutante: followed instructions at https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Gearman_deadlock
01:53 mutante: disabling and re-enabling gearman, zuul is not working and could be gearman deadlock

2016-11-02

22:06 hashar: hello stashbot
18:51 Krenair: armed keyholder on -tin and -mira
18:50 Krenair: started mysql on -db boxes to bring beta back online
10:54 hashar: Image ci-jessie-wikimedia-1478083637 in wmflabs-eqiad is ready
10:47 hashar: Force refresh Nodepool snapshot for Jessie so it get doxygen included T119140

2016-11-01

22:22 Krenair: started mysql on -db03 to hopefully pull us out of read-only mode
22:21 Krenair: started mysql on -db04
22:19 Krenair: stopped and started udp2log-mw on -fluorine02
22:10 hashar: Armed keyholder on deployment-tin . Instance had 20 minutes uptime and apparently keyholder does not self arm
22:00 Krenair: started moving nodes back to the new puppetmaster
02:55 Krenair: Managed to mess up the deployment-puppetmaster02 cert, had to move those nodes back

2016-10-31

20:57 Krenair: moving some nodes to deployment-puppetmaster02
16:57 bd808: Added Niharika29 as project member

2016-10-27

20:51 hashar: reboot integration-puppetmaster01
18:50 bd808: stashbot has replaced qa-morebots in this channel as the sole bot handling !log messages
18:46 bd808: Testing dual page wiki logging by stashbot. (check #3)
18:36 bd808: !log deployment-prep Testing dual page wiki logging by stashbot. (second attempt)
18:14 bd808: !log deployment-prep Testing dual page wiki logging by stashbot.
10:30 hashar: integration: on Trusty slaves, remove jenkins-deploy from KVM which is only needed for Android testing for T149294: salt -v '*slave-trusty*' cmd.run 'deluser jenkins-deploy kvm'
10:29 hashar: integration: on Trusty slaves, remove jenkins-deploy from KVM which is only needed for Android testing: salt -v '*slave-trusty*' cmd.run 'groupdeluser jenkins-deploy kvm'
10:25 hashar: integration: purge Android packages from Trusty slaves for T149294 : salt -v '*slave-trusty*' cmd.run 'apt-get --yes remove --purge gcc-multilib lib32z1 lib32stdc++6 qemu'

2016-10-25

19:21 hasharAway: Python PyPi mirror has some issue. Impacts all CI jobs relying on tox https://status.python.org/
10:39 elukey: cherry picked https://gerrit.wikimedia.org/r/#/c/314519/ and https://gerrit.wikimedia.org/r/#/c/306943/ to deployment-puppetmaster

2016-10-24

16:19 andrewbogott: upgrading deployment-puppetmaster to puppet 3.8.5 packages
09:14 hashar: rebasing integration puppet master

2016-10-21

09:42 gehel: decommission of deployment-elastic08 - T147777

2016-10-20

23:37 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/317083
20:53 legoktm: deploying https://gerrit.wikimedia.org/r/317022

2016-10-14

21:13 matt_flaschen: Ran START SLAVE to restart replication after columns created directly on replica were deleted.
20:53 bd808: Dropped lu_local_id, lu_global_id from replica db which were added improperly
20:37 matt_flaschen: Applied CentralAuth's patch-lu_local_id.sql migration for T148111, to sql --write
20:09 bd808: Applied CentralAuth's patch-lu_local_id.sql migration for T148111
11:30 dcausse: deployment-prep running sudo update-ca-certificates --fresh on deployment-ton to fix curl error code 60 in cirrus maint script (T145609)

2016-10-13

21:21 hashar: Deleted CI slaves integration-slave-jessie-1004 integration-slave-jessie-1005 integration-slave-trusty-1013 integration-slave-trusty-1014 integration-slave-trusty-1017 integration-slave-trusty-1018
20:12 hashar: Switching composer-hhvm / composer-php55 to Nodepool https://gerrit.wikimedia.org/r/#/c/306727/ T143938
16:23 gilles: Resetting to 61a9cd1f47c5aec8ded92f2486ce43309b9e3e03 on deployment-puppetmaster
16:06 godog: add settings to duplicate traffic to thumbor in beta and restart swift-proxy
16:03 gilles: Cherry-picking https://gerrit.wikimedia.org/r/#/c/315648/ on deployment-puppetmaster
15:35 gilles: Resetting to 61a9cd1f47c5aec8ded92f2486ce43309b9e3e03 on deployment-puppetmaster
14:38 gilles: Cherry-picking https://gerrit.wikimedia.org/r/#/c/315234/5 on deployment-puppetmaster
14:34 gilles: Resetting to 61a9cd1f47c5aec8ded92f2486ce43309b9e3e03 on deployment-puppetmaster
14:32 gilles: Cherry-picking https://gerrit.wikimedia.org/r/#/c/315234/4 on deployment-puppetmaster
14:32 gilles: Resetting to 61a9cd1f47c5aec8ded92f2486ce43309b9e3e03 on deployment-puppetmaster
14:27 gilles: Cherry-picking https://gerrit.wikimedia.org/r/#/c/315234/ on deployment-puppetmaster
14:22 gilles: Resetting to 61a9cd1f47c5aec8ded92f2486ce43309b9e3e03 on deployment-puppetmaster
13:42 gilles: Cherry picking https://gerrit.wikimedia.org/r/#/c/315248/ on deployment-puppetmaster

2016-10-12

13:37 elukey: upgraded memcached on deployment-memc04 to 1.4.28-1.1+wmf1 as part of a perf experiment (T129963) - rollback: wipe https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep/host/deployment-memc04, apt-get remove memcached on deployment-memc04, puppet run

2016-10-11

21:35 hasharAway: Force pushed Zuul patchqueue 5628f95...fc6a118 HEAD -> patch-queue/debian/precise-wikimedia
14:37 hashar: Mysql was down on Precise slaves. Apparently rebooted 17 days ago and I guess mysql does not spawn on boot. Restarted mysql on all Precise via: salt -v '*slave-precise*' cmd.run 'start mysql'
09:35 godog: reboot deployment-imagescaler01 to enable memory cgroup
08:29 hashar: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/#/c/313387/ Filter out refs/meta/config from all pipelines T52389

2016-10-10

15:45 dcausse: deployment-prep deployment-elastic0[5-8]: reduce the number of replicas to 1 max for all indices

2016-10-07

20:10 hashar: Created repository.integration.eqiad.wmflabs to play/Test Sonatype Nexus
20:10 hashar: rebooting integration-puppetmaster01
07:55 hashar: Upgrading Nodepool image for Jessie

2016-10-06

14:45 hashar: deployment-mira disarmed/rearmed keyholder in an attempt to clear a Shinken alarm
12:16 hashar: Jenkins slave deployment-tin.eqiad , removing label "deployment-tin.eqiad" it has "BetaClusterBastion" and all jobs are bound to it already

2016-10-05

19:33 andrewbogott: removing mediawiki::conftool from deployment-mediawiki04, deployment-mediawiki06, deployment-mediawiki05

2016-10-04

19:43 andrewbogott: removed contint::slave_scripts and associated files from deployment-sca01 and deployment-sca02
16:22 bd808: Restarted puppetmaster process on deployment-puppetmaster
16:20 bd808: deployment-puppetmaster: removing cherry-pick of https://gerrit.wikimedia.org/r/#/c/305256/; conflicts with upstream changes
15:01 godog: shutdown deployment-poolcounter02, replaced by deployment-poolcounter04 - T123734
09:03 hashar: Regenerating configuration of all Jenkins job due to https://gerrit.wikimedia.org/r/#/c/313306/
01:14 twentyafterfour: New scap command line autocompletions are now installed on deployment-tin and deployment-mira refs T142880

2016-10-03

22:40 thcipriani: manual rebase on deployment-puppetmaster:/var/lib/git/operations/puppet
22:05 thcipriani: reapplied beta::deployaccess to mediawiki servers
21:42 cscott: updated OCG to version 0bf27e3452dfdc770317f15793e93e6e89c7865a
21:36 cscott: starting OCG deploy
13:43 hashar: Added integration-slave-trusty-1014 back in the pool
13:41 hashar: Tip of the day: to reboot an instance and bypass molly-guard: /sbin/reboot
13:39 hashar: integration-slave-trusty-1014 upgrading packages, clean up and rebooting it
13:37 hashar: marked integration-slave-trusty-1014 offline. Cant run job / get stuck somehow
10:21 godog: add role::prometheus::node_exporter to classes in hiera:deployment-prep T144502

2016-10-01

09:41 hashar: beta: shutdown deployment-db1 and deployment-db2 . Databases have been migrated to other hosts T138778

2016-09-29

15:43 hashar: logstash-beta: refreshed the field list via https://logstash-beta.wmflabs.org/app/kibana#/settings/indices/logstash
13:52 hashar: Restarted jobrunner / jobchron on deployment-jobrunner02 . Were no more logging to /var/log/mediawiki/ somehow
13:51 hashar: Restarted udp2log on deployment-fluorine02
10:50 legoktm: deploying https://gerrit.wikimedia.org/r/313384
10:37 hashar: Jenkins upgrade AnsiColor plugin from 0.3.1 to 0.4.2
10:28 hashar: Upgrading Jenkins plugins with zeljkof :]
08:59 hashar: Hopefully going to get beta fixed via mw/core revert patch https://gerrit.wikimedia.org/r/313373

2016-09-28

23:56 MaxSem: Deleted varnish cache files on deployment-cache-upload04 to free up space, disk full
21:48 hasharAway: deployment-tin: service nscd restart
21:43 hasharAway: beta cluster update database is broken :/ Filled T146947 about it
21:25 hasharAway: deployment-tin: sudo -H -u www-data php5 /srv/mediawiki-staging/multiversion/MWScript.php update.php --wiki=commonswiki --quick
21:18 hasharAway: https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/ is broken for unkwnon reason :(
20:48 hasharAway: Deleted deployment-tin02 via Horizon. Replaced by deployment-tin
20:19 hasharAway: restarted keyholder on deployment-tin
20:11 hasharAway: Switch Jenkins slave deployment-mira.eqiad to deployment-tin.eqiad
20:09 hasharAway: deployment-tin: keyholder arm
20:08 hasharAway: deployment-tin for instance in `grep deployment /etc/dsh/group/mediawiki-installation`; do ssh-keyscan `dig +short $instance` >> /etc/ssh/ssh_known_hosts; done;
19:49 hasharAway: Dropping deployment-tin02 , replacing it with deployment-tin which has been rebuild to Jessie T144006
12:44 hashar: Cant finish up the switch to deployment-tin, puppet still does not pass due to weird clone issues ...
11:48 hashar: Deleting deployment-tin Trusty instance and recreate one with same hostname as Jessie; Meant to replace deployment-tin02 T144006
10:44 hashar: CI updating all mwext-Wikibase* jenkins jobs for https://gerrit.wikimedia.org/r/#/c/313056/ T142158
10:43 hashar: Updating slave scripts for "Disable garbage collection for mw-phpunit.sh" https://gerrit.wikimedia.org/r/313051 T142158
08:31 hashar: Reloading Zuul to deploy dc2ada37

2016-09-27

20:11 hashar: Reloading Zuul to deploy 3c3289aa1a for T143938 and T146783
16:29 anomie: Cherry-picked https://gerrit.wikimedia.org/r/#/c/313035/ on deployment-puppetmaster

2016-09-26

23:58 bd808: Started udp2log-mw on deployment-fluorine02 for T146723
11:35 hashar: deployment-salt02 : autoremoving a bunch of java related packages
11:31 hashar: rebooting deployment-salt02 has a kernel soft lock while hitting the disk
11:24 hashar: beta: mass upgrading all debian packages on all instances
10:32 hashar: beta: on deployment-pdf01 rm -fR /home/cscott/tmp/npm*
10:29 hashar: deployment-pdf01 apt-get upgrade / cleaning files left over etc
10:28 hashar: beta: on deployment-pdf01 rm -fR /home/cscott/.npm/ T145343

2016-09-24

20:08 hashar: deployment-tin is shutdown. Replaced by Jessie deployment-tin02
20:02 hashar: deployment-mira: ssh-keyscan deployment-tin02.deployment-prep.eqiad.wmflabs >> /etc/ssh/ssh_known_hosts
20:00 hashar: beta: dropping deployment-tin (ubuntu) replaced by deployment-tin02 (jessie). Primary is still deployment-mira (https://gerrit.wikimedia.org/r/#/c/312654/ T144578 )

2016-09-23

20:21 hashar: integration: salt -v '*trusty*' cmd.run 'service mysql start'
20:00 hashar: rebooting all CI permanent slaves. Making sure nothing is left on /mnt (which is no more mounted)
19:53 hashar: added a 30 minutes build timeout to https://integration.wikimedia.org/ci/job/phabricator-jessie-diffs/
15:02 hashar: rebooting integration-slave-jessie-1001
14:04 hashar: remove the /mnt based tmpfs for T146381 / https://gerrit.wikimedia.org/r/#/c/312518/ via: salt -v '*' cmd.run 'umount /mnt/home/jenkins-deploy/tmpfs'
13:41 hashar: Switching tmpfs from /mnt to /srv https://gerrit.wikimedia.org/r/#/c/312330/ and running fab deploy_slave_scripts

2016-09-22

19:29 hasharAway: switching Jenkins slaves workspace from /mnt/jenkins-workspace to /srv/jenkins-workspace (actually the same dir/inode on the filesystem)
01:52 legoktm: deploying https://gerrit.wikimedia.org/r/312158

2016-09-21

18:22 yuvipanda: shutting down integration-puppetmaster
17:26 yuvipanda: cherry-pick https://gerrit.wikimedia.org/r/#/c/312044/ on deployment-puppetmaser
16:41 hashar: deployment-tin02 initiale provisioning is complete. Gotta add it as a deployment server via a puppet.git patch
16:01 hashar: deployment-tin02 applied puppet classes beta::autoupdater, beta::deployaccess, role::deployment::server, role::labs::lvm::srv
15:32 hashar: spawned deployment-tin02
14:55 hashar: removed the CI puppet class from deployment-sca01 and deployment-sca02 . Stopped services using /srv , unmounted /srv, removed it from /etc/fstab
14:27 hashar: deployment-sca01 and deployment-sca02 are now broken. The CI puppet class mount /srv which ends up being only 500 MBytes
14:08 hashar: deployment-mira adding puppet class beta::autoupdater
14:06 hashar: Enabling Jenkins slave deployment-mira
14:05 hashar: deployment-mira seems ready for action and is the primary deployment server. Enabling jenkins to it
11:25 hashar: removing Jenkins slave deployment-tin , deployment-mira is the new deployment master T144578
10:58 hashar: Changing Jenkins slaves home dir for deployment-sca01 and deployment-sca02 from /mnt/home/jenkins-deploy to /srv/jenkins/home/jenkins-deploy
10:57 hashar: Changing Jenkins slaves home dir for deployment-tin and deployment-mira from /mnt/home/jenkins-deploy to /srv/jenkins/home/jenkins-deploy
10:10 hashar: deployment-mira removing "role::labs::lvm::srv" duplicate with role::ci::slave::labs::common
10:07 hashar: Making deployment-mira a Jenkins slave by applying puppet class role::ci::slave::labs::common T144578
10:05 hashar: Arming keyholder on deployment-mira
09:43 hashar: beta: switching master deployment server from deployment-tin to deployment-mira
09:34 hashar: From Hiera:deployment-prep remove bit already in puppet: "scap::deployment_server": deployment-tin.deployment-prep.eqiad.wmflabs
08:55 moritzm: remove mira from deployment-prep (replaced by deployment-mira)
08:37 hashar: beta: manually rebased puppetmaster
08:11 elukey: terminated jobrunner01 and removed from deployment-prep's sacp dsh list
07:19 legoktm: deploying https://gerrit.wikimedia.org/r/311927

2016-09-20

21:49 hashar: Deleting deployment-mira02 /srv was too small. Replaced by deployment-mira
20:54 hashar: from deployment-tin for T144578, accept ssh host key of deployment-mira : sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mira.deployment-prep.eqiad.wmflabs
20:47 hashar: Creating deployment-mira instance with flavor c8.m8.s60 (8 cpu, 8G RAM and 60G disk) T144578
19:00 thcipriani: cherry-picked https://gerrit.wikimedia.org/r/#/c/311760/ to deployment-puppetmaster to fix failing beta-scap-eqiad job, had to manually start rsync, puppet failed to start
18:38 hashar: on tin: `sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mira02.deployment-prep.eqiad.wmflabs` - T144006
18:33 hashar: on deployment-mira02 ran `sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mediawiki04.deployment-prep.eqiad.wmflabs` per T144006
18:01 marxarelli: deployed mediawiki-config changes on beta cluster. back in read/write mode using new database instances
17:37 marxarelli: deployment-db04 restored from backup and replication started
16:54 marxarelli: upgraded package and data to mariadb 10 on deployment-db03
16:31 marxarelli: cherry picking operations/puppet patches (T138778) to deployment-puppetmaster
16:30 moritzm: rebooting deployment-mira02
16:23 marxarelli: applied innodb transaction logs to deployment-db1 backup and successfully restored on deployment-db03
15:47 marxarelli: completed innobackupex on deployment-db1. copying backup to deployment-db03 for restoration
14:54 hashar: beta: cherry picking fix up for the jobrunner logging https://gerrit.wikimedia.org/r/#/c/311702/ and https://gerrit.wikimedia.org/r/311719 T146040
14:44 marxarelli: entering read-only mode on beta cluster
14:27 elukey: stopped puppet, jobrunner and jobchron on deployment-jobrunner01
14:20 marxarelli: disabling beta cluster jenkins jobs in preparation for data migration (T138778)
13:07 godog: add deployment-prometheus01 instance T53497
11:20 elukey: applied beta::deployaccess, role::labs::lvm::srv, role::mediawiki::jobrunner to jobrunner02
10:45 elukey: created deployment-jobrunner02 in deployment-prep

2016-09-19

22:01 legoktm: shutdown integration-puppetmaster
21:29 yuvipanda: regenerated client certs only on integration-puppetmaster01, seems ok now
20:46 yuvipanda: re-enable puppet everywhere
20:43 yuvipanda: enable puppet and run on integration-slave-trusty-1003.eqiad.wmflabs
20:41 yuvipanda: accidentally deleted /var/lib/puppet/ssl on integration-puppetmaster01 as well, causing it to lose keys. Reprovision by pointing to labs puppetmaster
20:34 yuvipanda: rm -rf /var/lib/puppet/ssl on all integration nodes
20:34 yuvipanda: copied /etc/puppet/puppet.conf from integration-trusty-slave-1001 to all integration
20:25 yuvipanda: delete /etc/puppet/puppet.conf.d/10-self.conf and /var/lib/puppet/ssl on integration-slave-trusty-1001
20:20 yuvipanda: re-enabled puppet on integration-slave-trusty-1001
20:08 yuvipanda: reset puppetmaster of integration-puppetmaster01 to be labs puppetmaster
20:03 yuvipanda: disable puppet across integration project, moving puppetmasters
19:49 legoktm: creating T144951 enabled role::puppetmaster::standalone role on integration-puppetmaster01
19:33 legoktm: creating T144951 integration-puppetmaster01 instance using m1.small and debian jessie
15:11 hashar: beta: updating jobrunner service 0dc341f..a0e8216

2016-09-17

07:11 legoktm: deploying https://gerrit.wikimedia.org/r/311024

2016-09-16

21:03 hashar: deployment-tin did a git gc on /srv/deployment/ores That freed up disk space and cleared an alarm on co master mira02
21:00 hashar: deleted deployment-parsoid05
20:52 hashar: fixed puppet on deployment-parsoid05 . Temporary instance will delete it later to clear out shinken.wmflabs.org
20:27 hashar: beta: force running puppet in batches of 4 instances: salt --batch 4 -v 'deployment-*' cmd.run 'puppet agent -tv'
20:13 hashar: beta: restarted puppetmaster
20:07 hashar: beta: salt -v '*' cmd.run 'rm -fR /var/lib/puppet/client/ssl/'
20:07 hashar: beta: stopping puppetmaster, rm -f /var/lib/puppet/server/ssl/ca/signed/*
19:53 hashar: beta created instance "deployment-parsoid05" Should be deleted later, that is merely to purge the hostname from Shinken ( http://shinken.wmflabs.org/host/deployment-parsoid05 )
11:42 hashar: beta: apt-get upgrade on deployment-jobrunner01
11:36 hashar: apt-get upgrade on deployment-tin , bring in a new hhvm version and others

2016-09-15

22:29 legoktm: sudo salt '*precise*' cmd.run 'service mysql start', all mysql's are down
16:45 godog: install xenial kernel on deployment-zotero01 and reboot T145793
16:18 hashar: prometheus enabled on all beta cluster instance. Does not support Precise hence puppet will fail on the last two Precise instances deployment-db1 and deployment-db2 until they are migrated to Jessie T138778
15:53 godog: add role::prometheus::node_exporter to classes in hiera:deployment-prep T144502
15:10 hashar: beta: Applying puppet class role::prometheus::node_exporter to mira02 just like mira. That is for godog
15:08 hashar: T144006 Disabled Jenkins job beta-scap-eqiad. On mira02 rm -fR /srv/* . Applying puppet for role::labs::lvm::srv
15:05 hashar: T144006 Applying class role::labs::lvm::srv to mira02 (it is out of disk space :D )
14:45 hashar: T144006 sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@mira02.deployment-prep.eqiad.wmflabs
14:44 hashar: T144006 sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mediawiki05.deployment-prep.eqiad.wmflabs
12:33 elukey: added base::firewall, beta::deployaccess, mediawiki::conftool, role::mediawiki::appserver to mediawiki05
12:20 elukey: terminate mediawiki02 to create mediawiki05
10:48 hashar: beta: cherry picking moritzm patch https://gerrit.wikimedia.org/r/#/c/310793/ "Also handle systemd in keyholder script" T144578
09:33 hashar: T144006 sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mediawiki06.deployment-prep.eqiad.wmflabs
09:10 elukey: executed git pull and then git rebase -i on deployment puppet master
08:52 elukey: terminated mediawiki03 and created mediawiki06
08:45 elukey: removed mediawiki03 from puppet with https://gerrit.wikimedia.org/r/#/c/310749/
02:36 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/310701

2016-09-14

21:37 hashar: integration: setting "ulimit -c 2097152" on all slaves due to Zend PHP segfaulting T142158
14:31 hashar: Added otto to integration labs project
13:28 gehel: upgrading deployment-logstash2 to elasticsearch 2.3.5 - T145404
09:27 hashar: Deleting deployment-mediawiki01 , replaced by deployment-mediawiki04 T144006
07:19 legoktm: sudo salt '*trusty*' cmd.run 'service mysql start', it was down on all trusty salves
07:17 legoktm: mysql just died on a bunch of slaves (trusty-1013, 1012, 1001)

2016-09-13

17:02 marxarelli: re-enabling beta cluster jenkins jobs following maintenance window
16:59 marxarelli: aborting beta cluster db migration due to time constraints and ops outage. will reschedule
15:34 marxarelli: disabled beta jenkins builds while in maintenance mode
15:18 marxarelli: starting 2-hour read-only maintenance window for beta cluster migration
10:06 hashar: beta: manually updated jobrunner install on deployment-jobrunner01 and deployment-tmh01 then reloaded the services with: service jobchron reload
10:02 hashar: Trebuchet is broken for /srv/deployment/jobrunner/jobrunner cant reach the deploy minions somehow. Did the update manually
10:00 hashar: Upgrading beta cluster jobrunner to catch up with upstream b952a7c..0dc341f merely picking up a trivial log change ( https://gerrit.wikimedia.org/r/#/c/297935/ )
09:40 hashar: Unpooled deployment-mediawiki01 from scap and varnish. Shutting down instance. T144006
09:02 hashar: on deployment-tin, accepted mediawiki04 host key for jenkins-deploy user : sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mediawiki04.deployment-prep.eqiad.wmflabs T144006
08:26 hashar: mwdeploy@deployment-mediawiki04 manually accepted ssh host key of deployment-tin T144006
08:17 hashar: beta: manually accepted ssh host key for deployment-mediawiki04 as user mwdeploy on deployment-tin and mira T144006
07:46 gehel: upgrading elasticsearch to 2.3.5 on deployment-elastic0? - T145404

2016-09-12

14:41 elukey: applied base::firewall, beta::deployaccess, mediawiki::conftool, role::mediawiki::appserver to deployment-mediawiki04.deployment-prep.eqiad.wmflabs (Debian jessie instance) - T144006
12:50 gehel: rolling back upgrading elasticsearch to 2.4.0 on deployment-elastic05 - T145058
12:03 gehel: upgrading elasticsearch to 2.4.0 on deployment-elastic0? - T145058
12:01 hashar: Gerrit: made analytics-wmde group to be owned by themselves
11:57 hashar: Gerrit: added ldap/wmde as an included group of the 'wikidata' group. Asked by and demoed to addshore

2016-09-11

18:45 legoktm: deploying https://gerrit.wikimedia.org/r/309829

2016-09-09

20:53 thcipriani: testing scap 3.2.5-1 on beta cluster
11:08 hashar: Added git tag for latest versions of mediawiki/selenium and mediawiki/ruby/api
09:30 legoktm: Image ci-jessie-wikimedia-1473412532 in wmflabs-eqiad is ready
08:53 legoktm: added phpflavor-php70 label to integration-slave-jessie-100[1-5]
08:49 legoktm: deploying https://gerrit.wikimedia.org/r/309048

2016-09-08

21:33 hashar: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/309413 " Inject PHP_BIN=php5 for php53 jobs"
20:00 hashar: nova delete ci-jessie-wikimedia-369422 (was stuck in deleting state)
19:49 hashar: Nodepool, deleting instances that Nodepool lost track of (from nodepool alien-list)
19:47 hashar: nodepool cant delete: ci-jessie-wikimedia-369422 [ delete | 2.24 hours . Stuck in task_state=deleting :(
19:46 hashar: Nodepool looping over some tasks since 17:45 ( https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=21&fullscreen )
19:26 legoktm: repooled integration-slave-jessie-1005 now that php7 testing is done
19:19 hashar: integration: salt -v '*' cmd.run 'cd /srv/deployment/integration/slave-scripts; git pull' | https://gerrit.wikimedia.org/r/308931
19:12 hashar: integration: salt -v '*' cmd.run 'cd /srv/deployment/integration/slave-scripts; git pull' | https://gerrit.wikimedia.org/r/309272
17:08 legoktm: deleted integration-jessie-lego-test01
16:50 legoktm: deleted integration-aptly01
10:03 hashar: Delete Jenkins job https://integration.wikimedia.org/ci/job/mwext-VisualEditor-sync-gerrit/ that has been left behind. It is no more needed. T51846 T86659
10:02 hashar: Delete mwext-VisualEditor-sync-gerrit job, already got removed by ostriches in 139d17c8f1c4bcf2bb761e13a6501e4d85684066 . The issue in Gerrit (T51846) has been fixed. Poke T86659 , one less job on slaves.

2016-09-07

20:44 matt_flaschen: Re-enabled beta-code-update-eqiad .
20:35 hashar: Updated security group for deployment-prep labs project. Allow ssh port 22 from contint1001.wikimedia.org (matching rules for gallium). T137323
20:30 hashar: Updated security group for contintcloud and integration labs project. Allow ssh port 22 from contint1001.wikimedia.org (matching rules for gallium). T137323
20:14 matt_flaschen: Temporarily disabled https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/ to test live revert of aa0f6ea
16:09 hashar: Nodepool back in action. Had to manually delete some instances in labs
15:58 hashar: Restarting Nodepool . Lost state when labnet got moved T144945
13:13 hashar: Image ci-jessie-wikimedia-1473253681 in wmflabs-eqiad is ready , has php7 packages. T144872
11:53 hashar: Force refreshing Nodepool jessie snapshot to get PHP7 included T144872
11:03 hashar: integration: cherry pick https://gerrit.wikimedia.org/r/#/c/308955/ "contint: prefer our bin/php alternative" T144872
10:55 hashar: integration: dropped PHP7 cherry pick from puppet master. https://gerrit.wikimedia.org/r/#/c/308918/ has been merged. Pushing it to the fleet of permanent Jessie slaves. T144872
10:37 hashar: beta: cleaning up salt-keys on deployment-salt02 . Bunch of instances got deleted
09:41 hashar: Moving rake jobs back to Nodepool ( T143938 ) with https://gerrit.wikimedia.org/r/#/c/306723/ and https://gerrit.wikimedia.org/r/#/c/306724/
05:57 legoktm: deploying https://gerrit.wikimedia.org/r/308932 https://gerrit.wikimedia.org/r/299697
05:26 legoktm: cherry-picked https://gerrit.wikimedia.org/r/#/c/308918/ onto integration-puppetmaster with a hack that has it only apply to integration-slave-jessie-1005
04:59 legoktm: added Krenair to integration project to help debug puppet stuff
04:35 legoktm: depooled integration-slave-jessie-1005 in jenkins so I can test puppet stuff on it

2016-09-06

13:58 hashar: Qunit jobs should be all fine again now. T144802
13:46 hashar: nodepool.SnapshotImageUpdater: Image ci-jessie-wikimedia-1473169259 in wmflabs-eqiad is ready T144802
13:20 hashar: Rebuilding Nodepool Jessie image to hopefully include libapache-mod-php5 and restore qunit jobs behavior T144802
10:37 hashar: gerrit: mark apps/android/commons hidden since it is now community maintained on GitHub. Will avoid confusion. T127678
09:11 hashar: nodepool.SnapshotImageUpdater: Image ci-trusty-wikimedia-1473152801 in wmflabs-eqiad is ready
09:06 hashar: nodepool.SnapshotImageUpdater: Image ci-jessie-wikimedia-1473152393 in wmflabs-eqiad is ready
09:00 hashar: Trying to refresh Nodepool Jessie image . Image properties have been dropped, should fix it

2016-09-05

14:08 hashar: Refreshing Nodepool base images for Trusty and Jessie. Managed to build new ones after T143769

2016-09-02

20:36 legoktm: deploying https://gerrit.wikimedia.org/r/308227
15:17 hashar: Bringing tox jobs to Nodepool with https://gerrit.wikimedia.org/r/#/c/306725/

2016-09-01

19:00 urandom: T130861: Restarting Cassandra on deployment-restbase0[1-2]
18:58 urandom: T130861: De-cherry-picking https://gerrit.wikimedia.org/r/#/c/282466/
18:34 urandom: T130861: Restarting Cassandra on deployment-restbase0[1-2]
18:32 urandom: T130861: Cherry picking https://gerrit.wikimedia.org/r/#/c/282466/ to deployment-puppetmaster
16:38 legoktm: deploying https://gerrit.wikimedia.org/r/307794
12:22 hashar: migrating deployment-tin keyholder to use base::service_unit for moritm https://gerrit.wikimedia.org/r/#/c/307510/ + reboot + keyholder arm
03:09 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/307909

2016-08-31

23:40 bd808: forced puppet run on deployment-salt02. Had not run automatically for 8 hours
23:36 bd808: Deleted /data/scratch on integration-slave-trusty-1016 to fix puppet
23:32 bd808: Deleted /data/scratch on integration-slave-trusty-1013 to fix puppet
23:22 bd808: Deleted /data/scratch on integration-slave-trusty-1012 to fix puppet
23:19 bd808: Deleted /data/scratch on integration-slave-trusty-1011 to fix puppet
23:15 bd808: Deleted /data/scratch on integration-slave-precise-1012 to fix puppet
23:11 bd808: Deleted /data on integration-slave-precise-1011 to fix puppet
23:08 bd808: Deleted /data on integration-slave-jessie-1001 to fix puppet
23:04 bd808: Deleted empty /data, /data/project, and /data/scratch on integration-puppetmaster to fix puppet
22:59 bd808: Deleted empty /data, /data/project, and /data/scratch on integration-publisher to fix puppet
01:44 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/307670

2016-08-30

23:31 yuvipanda: cherry-picking https://gerrit.wikimedia.org/r/#/c/307656/ fixed puppet on the elasticsearch machines!
22:29 yuvipanda: in lieu of blood sacrifice, restart puppetmaster on deployment-pupetmaster
21:44 yuvipanda: use clush to fix puppet.conf of all clients, realize also accidentally set a client's puppet.conf for the server, recover server's old conf file from a cat in shell history, restore, breathe sigh of relief
21:37 yuvipanda: sudo takes like 15s each time, is there no god?
21:36 yuvipanda: managed to get vim into a state where I can not quit it, probably recording a macro. I hate computers
21:16 yuvipanda: deployment-pdf01 fixed manually
21:15 yuvipanda: deployment-pdf02 has proper ssl certs mysteriously without me doing anything
21:06 yuvipanda: moved deployment-db[12], deployment-stream to not use role::puppet::self, attempting to semi-automate rest
20:52 yuvipanda: cherry-picked appropriate patch on deployment-puppetmaster for T120159, did https://wikitech.wikimedia.org/w/index.php?title=Hiera:Deployment-prep/host/deployment-puppetmaster&oldid=818847 to make sure the puppetmaster allows connections from elsewhere
19:48 legoktm: deploying https://gerrit.wikimedia.org/r/306710
19:13 bd808: Fixed puppet runs on deployment-sca0[12] with cherry-pick of https://gerrit.wikimedia.org/r/#/c/307561
18:57 bd808: Duplicate declaration: File[/srv/deployment] is already declared in file /etc/puppet/modules/contint/manifests/deployment_dir.pp:14; cannot redeclare at /etc/puppet/modules/service/manifests/deploy/common.pp:12 on node deployment-sca01.deployment-prep.eqiad.wmflabs
18:40 bd808: Puppet busted on deployment-aqs01 -- Could not find data item analytics_hadoop_hosts in any Hiera data file and no default supplied at /etc/puppet/manifests/role/aqs.pp:46
12:59 hashar: beta: revert master branch to origin. Ran scap and enabled again beta-code-update-eqiad job.
12:55 hashar: Running scap on beta cluster via https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/117786/console T143889
12:53 hashar: Cherry picking https://gerrit.wikimedia.org/r/#/c/307501/ on beta cluster for T143889
12:51 hashar: disabling https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/ to cherry pick a revert patch

2016-08-29

07:56 hashar: hard rebooting integration-slave-trusty-1012 via horizon and restarting puppet manually
07:50 hashar: integration-slave-trusty-1013 puppet.conf certname was set to 'undef' breaking puppet

2016-08-27

20:51 hashar: integration: tweak sudo policy for jenkins-deploy running cowbuilder: env_keep+=DEB_BUILD_OPTIONS
20:24 hashar: Manually installing jenkins-debian-glue 0.17.0 on integration-slave-jessie-1004 and integration-slave-jessie-1005 ( T142891 ) . That is to support PBUILDER_USENETWORK T141114
20:05 hashar: Jenkins added global env variable BUILD_TIMEOUT set to 30 for T144094

2016-08-26

22:29 legoktm: deploying https://gerrit.wikimedia.org/r/307025
08:15 Amir1: restart uwsgi-ores and celery-ores-worker in deployment-sca03 (T143567)
08:11 hashar: beta-scap-eqiad job is back in operation. Was blocked on logstash not being reachable. T143982
08:10 hashar: deployment-logstash2 is back after a hard reboot. T143982
08:07 hashar: rebooting deployment-logstash02 via Horizon. Kernel hang apparently T143982
08:00 hashar: beta-scap-eqiad failing investigating
07:54 Amir1: cherry-picked 306839/1 into deployment-puppetmaster
00:28 twentyafterfour: restarted puppetmaster service on deployment-puppetmaster

2016-08-25

23:15 Amir1: cherry-picked 306839/1 into puppetmaster
20:10 hashar: Delete integration-slave-trusty-1023 with label AndroidEmulator. The Android job has been migrated to a new Jessie based instance via T138506
19:05 hashar: hard rebooting integration-raita via Horizon
16:04 hashar: fixing puppet.conf on integration-slave-trusty-1013 it mysteriously considered itself as the puppetmaster
16:02 hashar: integration restarted puppetmaster service
08:28 hashar: beta update database fixed
08:28 hashar: beta cluster update database failed due to: "Your composer.lock file is up to date with current dependencies!" Probably a race condition with ongoing scap.

2016-08-24

15:14 halfak: deploying ores d00171
09:50 hashar: deployment-redis02 fixed AOF file /srv/redis/deployment-redis02-6379.aof and restarted the redis instance should fix T143655 and might help T142600
09:43 hashar: T143655 stopping redis 6379 on deployment-redis02 : initctl stop redis-instance-tcp_6379
09:38 hashar: deployment-redis02 initctl stop redis-instance-tcp_6379 && initctl start redis-instance-tcp_6379 | That did not fix it magically though T143655

2016-08-23

18:21 legoktm: deploying https://gerrit.wikimedia.org/r/306257
16:38 bd808: Fixed ops/puppet sync by removing stale cherry-pick of https://gerrit.wikimedia.org/r/#/c/305996/
08:22 hashar: running puppet on integration-slave-trusty-1014
08:18 hashar: reboot integration-slave-trusty-1014
08:16 hashar: disabled/enabled Jenkins Gearman client to remove deadlock with Throttle plugin

2016-08-22

23:40 legoktm: updating slave_scripts on all slaves

2016-08-18

22:03 bd808: deployment-fluorine02: Hack 'datasets:x:10003:997::/home/datasets:/bin/bash' into /etc/passwd for T117028
20:30 MaxSem: Restarted hhvm on appservers for wikidiff2 upgrades
19:03 MaxSem: Upgrading hhvm-wikidiff2 in beta cluster
16:53 legoktm: deploying https://gerrit.wikimedia.org/r/#/c/305532/

2016-08-17

22:28 legoktm: deploying https://gerrit.wikimedia.org/r/305408
21:33 cscott: updated OCG to version e3e0fd015ad8fdbf9da1838c830fe4b075c59a29
21:28 bd808: restarted salt-minion on deployment-pdf02
21:26 bd808: restarted salt-minion on deployment-pdf01
21:15 cscott: starting OCG deploy to beta
14:10 gehel: upgrading elasticsearch to 2.3.4 on deployment-logstash2.deployment-prep.eqiad.wmflabs
13:28 gehel: upgrading elasticsearch to 2.3.4 on deployment-elastic*.deployment-prep + JVM upgrade

2016-08-16

23:10 thcipriani: max_servers at 6, seeing 6 allocated instances, still seeing 403 already used 10 of 10 instances :((
22:37 thcipriani: restarting nodepool, bumping max_servers to match up with what openstack seems willing to allocate (6)
09:06 Amir1: removing ores-related-cherry-picked commits from deployment-puppetmaster

2016-08-15

21:30 thcipriani: update scap on beta to 3.2.3-1 bugfix release
02:30 bd808: Forced a zuul restart -- https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Restart
02:23 bd808: Lots and lots of "AttributeError: 'NoneType' object has no attribute 'name'" errors in /var/log/zuul/zuul.log
02:21 bd808: nodepool delete 301068
02:20 bd808: nodepool delete 301291
02:20 bd808: nodepool delete 301282
02:19 bd808: nodepool delete 301144
02:11 bd808: nodepool delete 299641
02:11 bd808: nodepool delete 278848
02:08 bd808: Aug 15 02:07:48 labnodepool1001 nodepoold[24796]: Forbidden: Quota exceeded for instances: Requested 1, but already used 10 of 10 instances (HTTP 403)

2016-08-13

23:16 Amir1: cherry-picking 304678/1 into the puppetmaster
00:08 legoktm: deploying https://gerrit.wikimedia.org/r/304588
00:06 legoktm: deploying https://gerrit.wikimedia.org/r/304068

2016-08-12

23:57 legoktm: p
23:57 legoktm: deploying https://gerrit.wikimedia.org/r/304587, no-o
18:19 Amir1: deploying 2ef24f2 to ores-beta in sca03

2016-08-10

23:56 legoktm: deploying https://gerrit.wikimedia.org/r/304149
23:47 thcipriani: stopping nodepool to clean up
23:41 legoktm: deploying https://gerrit.wikimedia.org/r/304131
21:59 thcipriani: restarted nodepool, no trusty instances were being used by jobs
01:58 legoktm: deploying https://gerrit.wikimedia.org/r/303218

2016-08-09

23:21 Amir1: ladsgroup@deployment-sca03:~$ sudo service celery-ores-worker restart
15:24 thcipriani: due to https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Jenkins_execution_lock
15:20 thcipriani: beta site updates stuck for 15 hours :(
02:17 legoktm: deploying https://gerrit.wikimedia.org/r/303741
02:16 legoktm: manually updated slave-scripts on all slaves via `fab deploy_slave_scripts`
00:56 legoktm: deploying https://gerrit.wikimedia.org/r/303726

2016-08-08

23:33 Tim: deleted instance deployment-depurate01
16:19 bd808: Manually cleaned up root@logstash02 cronjobs related to logstash03
14:39 Amir1: deploying d00159c for ores in sca03
10:14 Amir1: deploying 616707c into sca03 (for ores)

2016-08-07

12:01 hashar: Nodepool: can't spawn instances due to: Forbidden: Quota exceeded for instances: Requested 1, but already used 10 of 10 instances (HTTP 403)
12:01 hashar: nodepool: deleted servers stuck in "used" states for roughly 4 hours (using: nodepool list , then nodepool delete <id>)
11:54 hashar: Nodepool: can't spawn instances due to: Forbidden: Quota exceeded for instances: Requested 1, but already used 10 of 10 instances (HTTP 403)
11:54 hashar: nodepool: deleted servers stuck in "used" states for roughly 4 hours (using: nodepool list , then nodepool delete <id>)

2016-08-06

12:31 Amir1: restarting uwsgi-ores and celery-ores-worker in deployment-sca03
12:28 Amir1: cherry-picked 303356/1 into the puppetmaster
12:00 Amir1: restarting uwsgi-ores and celery-ores-worker in deployment-sca03

2016-08-05

17:54 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/299825/3 for testing
17:50 bd808: Removed stale cherry-picks for https://gerrit.wikimedia.org/r/#/c/302303/ and https://gerrit.wikimedia.org/r/#/c/300458/ that were blocking git rebase
00:41 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/303113
00:31 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/300068

2016-08-04

20:07 marxarelli: Running jenkins-jobs update config/ 'selenium-*' to deploy https://gerrit.wikimedia.org/r/#/c/302775/
17:03 legoktm: jstart -N qamorebots /usr/lib/adminbot/adminlogbot.py --config ./confs/qa-logbot.py

2016-08-01

20:28 thcipriani: restarting deployment-ms-be01, not responding to ssh, mw-fe01 requests timing out
08:28 Amir1: deploying fedd675 to ores in sca03

2016-07-29

23:27 bd808: Rebooting deployment-logstash2; Console showed hung task timeouts (P3606)
15:55 hasharAway: pooled Jenkins slave integration-slave-jessie-1003 [10.68.21.145]
14:02 hashar: deployment-prep / beta : added addshore to the project
13:24 hashar: created integration-slave-jessie-1003 m1.medium to help processing debian-glue jobs
13:01 hashar: Upgrading Zuul on jessie slaves using https://people.wikimedia.org/~hashar/debs/zuul_2.1.0-391-gbc58ea3-jessie/zuul_2.1.0-391-gbc58ea3-wmf2jessie1_amd64.deb
12:53 hashar: Upgrading Zuul on precise slaves using https://people.wikimedia.org/~hashar/debs/zuul_2.1.0-391-gbc58ea3/zuul_2.1.0-391-gbc58ea3-wmf2precise1_amd64.deb
09:38 hashar: Upgrading Zuul to get rid of a forced sleep(300) whenever a patch is merged T93812. zuul_2.1.0-391-gbc58ea3-wmf2precise1

2016-07-28

21:46 hashar_: xintegration: change sudo policy for jenkins-deploy to help on T141538 : env_keep+=WORKSPACE
12:18 hashar: installed 2.1.0-391-gbc58ea3-wmf1jessie1 on zuul-dev-jessie.integration.eqiad.wmflabs T140894
12:18 hashar: installed 2.1.0-391-gbc58ea3-wmf1jessie1 on zuul-dev-jessie.integration.eqiad.wmflabs
09:46 hashar: Nodepool: Image ci-trusty-wikimedia-1469698821 in wmflabs-eqiad is ready
09:35 hashar: Regenerated Nodepool image for Trusty. The snapshot failed while upgrading grub-pc for some reason. Noticed with thcipriani yesterday

2016-07-27

16:13 hashar: salt -v '*slave-trusty*' cmd.run 'service mysql start' ( was missing on integration-slave-trusty-1011.integration.eqiad.wmflabs )
14:03 hashar: upgraded zuul on gallium via dpkg -i /root/zuul_2.1.0-391-gbc58ea3-wmf1precise1_amd64.deb (revert is zuul_2.1.0-151-g30a433b-wmf4precise1_amd64.deb )
12:43 hashar: restarted Jenkins for some trivial plugins updates
12:35 hashar: hard rebooting integration-slave-trusty-1011 from Horizon. ssh lost, no log in Horizon.
09:46 hashar: manually triggered debian-glue on all operations/debs repo that had no jenkins-bot vote. Via zuul enqueue on gallium and list fetched from "gerrit query --current-patch-set 'is:open NOT label:verified=2,jenkins-bot project:^operations/debs/.*'|egrep '(ref|project):'"
06:21 Tim: created instance deployment-depurate01 for testing of role::html5depurate

2016-07-26

20:13 hashar: Zuul deployed https://gerrit.wikimedia.org/r/301093 which adds 'debian-glue' job on all of operations/debs/ repos
18:10 ostriches: zuul: reloading to pick up config change
12:49 godog: cherry-pick https://gerrit.wikimedia.org/r/#/c/300827/ on deployment-puppetmaster
11:59 legoktm: also pulled in I73f01f87b06b995bdd855628006225879a17fee5
11:59 legoktm: deploying https://gerrit.wikimedia.org/r/301109
11:37 hashar: rebased integration puppetmaster git repo
11:31 hashar: enable puppet agent on integration-puppetmaster . Had it disabled while hacking on https://gerrit.wikimedia.org/r/#/c/300830/
08:42 hashar: T141269 On integration-slave-trusty-1018 , deleting workspace that has a corrupt git: rm -fR /mnt/jenkins-workspace/workspace/mediawiki-extensions-hhvm*
01:08 Amir1: deployed ores a291da1 in sca03, ores-beta.wmflabs.org works as expected

2016-07-25

22:45 legoktm: restarting zuul due to depends-on lockup
14:24 godog: bounce puppetmaster on deployment-puppetmaster
13:17 godog: cherry-pick https://gerrit.wikimedia.org/r/#/c/300827/ on deployment-puppetmaster

2016-07-23

20:06 bd808: Cleanup jobrunner01 logs via -- sudo logrotate --force /etc/logrotate.d/mediawiki_jobrunner
20:03 bd808: Deleted jobqueues in redis with no matching wikis: ptwikibooks, labswiki
19:20 bd808: jobrunner01 spamming /var/log/mediawiki with attempts to process jobs for wiki=labswiki

2016-07-22

20:26 hashar: T141114 upgraded jenkins-debian-glue from v0.13.0 to v0.17.0 on integration-slave-jessie-1001 and integration-slave-jessie-1002
19:07 thcipriani: beta-cluster has successfully used a canary for mediawiki deployments
16:53 thcipriani: bumping scap to v.3.2.1 on deployment-tin to test canary deploys, again
16:46 thcipriani: rolling back scap version to v.3.2.0
16:38 thcipriani: bumping scap to v.3.2.1 on deployment-tin to test canary deploys
13:02 hashar: zuul rebased patch queue on tip of upstream branch and force pushed branch. c3d2810...4ddad4e HEAD -> patch-queue/debian/precise-wikimedia (forced update)
10:32 hashar: Jenkins restarted and it pooled both integration-slave-jessie-1002 and integration-slave-trusty-1018
10:23 hashar: Jenkins has some random deadlock. Will probably reboot it
10:17 hashar: Jenkins can't ssh / add slaves integration-slave-jessie-1002 or integration-slave-trusty-1018 . Apparently due to some Jenkins deadlock in the ssh slave plugin :-/ Lame way to solve it: restart Jenkins
10:10 hashar: rebooting integration-slave-jessie-1002 and integration-slave-trusty-1018 . Hang somehow
10:06 hashar: T141083 salt -v '*slave-trusty*' cmd.run 'service mysql start'
09:55 hashar: integration-slave-trusty-1001 service mysql start

2016-07-21

16:11 hashar: Updated our JJB fork cherry picking f74501e781f by madhuvishy. Was made to support the maven release plugin. Branch bump is 10f2bcd..6fcaf39
16:04 hashar: integration/zuul.git .Updated upstream branch:bc58ea34125f11eb353abc3e5b96ac1efad06141 finally caught up with upstream \O/
15:13 hashar: integration/zuul.git .Updated upstream branch: 06770a85fcff810fc3e1673120710100fc7b0601:upstream
14:03 hashar: integration/zuul.git bumping upstream branch: git push d34e0b4:upstream
03:18 greg-g: had to do https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Hung_beta_code.2Fdb_update twice, seems to be back
00:13 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/299825/ to deployment-puppetmaster so wdqs nginx log parsing can be tested

2016-07-20

13:55 hashar: beta: switching job beta-scap-eqiad to use 'scap sync' per https://gerrit.wikimedia.org/r/#/c/287951/ (poke thcipriani )
12:47 hashar: integration: enabled unattended upgrade on all instances by adding contint::packages::apt to https://wikitech.wikimedia.org/wiki/Hiera:Integration
10:28 hashar: beta dropped salt-key on deployment-salt02 for the three instances: deployment-upload.deployment-prep.eqiad.wmflabs , deployment-logstash3.deployment-prep.eqiad.wmflabs and deployment-ores-web.deployment-prep.eqiad.wmflabs
10:26 hashar: beta: rebased puppetmaster git repo. "Parsoid: Move to service::node" has weird conflict https://gerrit.wikimedia.org/r/#/c/298436/
10:15 hashar: beta: removing puppet cherry pick of https://gerrit.wikimedia.org/r/#/c/258979/ "mediawiki: add conftool-specifc credentials and scripts" abandonned/superseeded and caused a conflict
08:17 hashar: deployment-fluorine : deleting a puppet lock file /var/lib/puppet/state/agent_catalog_run.lock (created at 2016-07-18 19:58:46 UTC)
01:53 legoktm: deploying https://gerrit.wikimedia.org/r/299930

2016-07-18

20:56 thcipriani: Deleted deployment-fluorine:/srv/mw-log/archive/*-201605* freed 30 GB
15:00 hashar: Upgraded Zuul on the Precise slaves to zuul_2.1.0-151-g30a433b-wmf4precise1
12:10 hashar: (restarted qa-morebots)
12:10 hashar: Enabling puppet again on integration-slave-precise-1002 , removing Zuul-server config and adding the slave back in Jenkins pool

2016-07-16

23:19 paladox: testing morebots

2016-07-15

08:34 hashar: Unpooling integration-slave-precise-1002 will use it as a zuul-server test instance temporarily

2016-07-14

18:54 ebernhardson: deployment-prep manually edited elasticsearch.yml on deployment-elastic05 and restarted to get it listening on eth0. Still looking into why puppet wrote out wrong config file
09:05 Amir1: rebooting deployment-ores-redis
08:29 Amir1: deploying 0e9555f to ores-beta (sca03)

2016-07-13

16:05 urandom: Installing Cassandra 2.2.6-wmf1 on deployment-restbase0[1-2].deployment-prep.eqiad.wmflabs : T126629
13:58 hashar: T137525 reverted Zuul back to zuul_2.1.0-95-g66c8e52-wmf1precise1_amd64.deb . It could not connect to Gerrit reliably
13:46 hashar: T137525 Stopped zuul that ran in a terminal (with -d). Started it with the init script.
11:37 hashar: apt-get upgrade on deployment-mediawiki02
08:33 hashar: removing deployment-parsoid05 from the Jenkins slaves T140218

2016-07-12

20:29 hashar: integration: force running unattended upgrade on all instances: salt --batch 4 -v '*' cmd.run 'unattended-upgrade' . That upgrades diamond and hhvm among others. imagemagick-common has a prompt though
20:22 hashar: CI force running puppet on all instances: salt --batch 5 -v '*' puppet.run
20:04 hashar: Maybe fix unattended upgrade on the CI slaves via https://gerrit.wikimedia.org/r/298568
16:43 Amir1: deploying f472f65 to ores-beta
10:11 hashar: Github created repos operations-debs-contenttranslation-apertium-mk-en and operations-docker-images-toollabs-images for Gerrit replication

2016-07-11

14:24 hashar: Removing ZeroMQ config from the Jenkins jobs. It is now enabled globally. T139923
10:16 hashar: T136188: on Trusty slaves, upgrading Chromium from v49 to v51: salt -v '*slave-trusty-*' cmd.run 'apt-get -y install chromium-browser chromium-chromedriver chromium-codecs-ffmpeg-extra'
10:13 hashar: T136188: salt -v '*slave-trusty*' cmd.run 'rm /etc/apt/preferences.d/chromium-*'
10:09 hashar: Unpinning Chromium v49 from the Trusty slaves and upgrading to v51 for T136188
09:34 zeljkof: Enabled ZMQ Event Publisher on all Jobs in Jenkins

2016-07-09

18:57 legoktm: deploying https://gerrit.wikimedia.org/r/297731 and https://gerrit.wikimedia.org/r/298142
14:07 bd808: Testing logstash change https://gerrit.wikimedia.org/r/#/c/298115/ via cherry-pick

2016-07-08

16:08 hashar: scandium: git -C /srv/ssd/zuul/git/mediawiki/services/graphoid remote set-head origin --auto
16:06 hashar: scandium: git -C /srv/ssd/zuul/git/mediawiki/services/graphoid init && git -C /srv/ssd/zuul/git/mediawiki/services/graphoid remote add origin ssh://jenkins-bot@ytterbium.wikimedia.org:29418/mediawiki/services/graphoid
14:59 hashar: nodepool: rebuild Trusty image from scratch Image ci-trusty-wikimedia-1467989709 in wmflabs-eqiad is ready
12:35 hashar: beta: find /data/project/upload7/*/*/thumb -type f -atime +30 -delete
10:31 hashar: beta: mass delete http://commons.wikimedia.beta.wmflabs.org/wiki/Category:GWToolset_Batch_Upload files T64835
10:26 hashar: beta: mass delete http://commons.wikimedia.beta.wmflabs.org/wiki/Category:GWToolset_Batch_Upload files

2016-07-07

21:41 MaxSem: Chowned php-master/vendor back to jenkins-deploy
13:10 hashar: deleting integration-slave-trusty-1024 and integration-slave-trusty-1025 to free up some RAM. We have enough permanent Trusty slaves. T139535
02:43 MaxSem: started redis-server on deployment-stream
01:14 bd808: Restarted logstash on deployment-logstash2
01:13 MaxSem: Leaving my hacks for the night to collect data, if needed revert with cd /srv/mediawiki-staging/php-master/vendor && sudo git reset --hard HEAD && sudo chown -hR jenkins-deploy:wikidev .
00:50 bd808: Rebooting deployment-logstash3.eqiad.wmflabs; console full of hung process messages from kernel
00:27 MaxSem: Initialized ORES on all wikis where it's enabled, was causing job failures
00:13 MaxSem: Debugging a fatal in betalabs, might cause syncs to fail

2016-07-06

20:30 hashar: beta: restarted mysql on both db1 and db2 so it takes in account the --syslog setting T119370
20:08 hashar: beta: on db1 and db2 move the MariaDB 'syslog' setting under [mysqld_safe] section. Cherry picked https://gerrit.wikimedia.org/r/#/c/296713/3 and reloaded mysql on both instances. T119370
14:54 hashar: Image ci-jessie-wikimedia-1467816381 in wmflabs-eqiad is ready T133779
14:47 hashar_: attempting to refresh ci-jessie-wikimedia image to get librdkafka-dev included for T133779

2016-07-05

21:54 hasharAway: CI has drained the gate-and-submit queue
21:37 hasharAway: Nodepool: nodepool delete a few instances that would never spawn / have been stuck for ~ 40 minutes

2016-07-04

18:58 hashar: Upgrading arcanist on permanent CI slaves since xhpast was broken T137770
12:50 yuvipanda: migrating deployment-tin to labvirt1011

2016-07-03

13:10 paladox: phabricator Update phab-01 and phab-05 (phab-02) and phab-03 to fix a security bug in phabricator (Did the update last night but forgot to log it)
12:04 jzerebecki: reloading zuul for 7e6a2e2..13ea50f

2016-07-02

13:38 jzerebecki: reloading zuul for 15127b2..7e6a2e2

2016-06-30

10:31 hashar: Deleting integration-slave-trusty-1015 . Can not bring up mysql T138074 and the ssh slave connection would not hold anyway. Must be broken somehow
10:04 hashar: Attempting to refresh Nodepool image for Jessie ( ci-jessie-wikimedia ). Been stall for 284 hours (12 days)
09:36 hashar: Trusty is missing the package arcanist ... :(
09:35 hashar: Attempting to refresh Nodepool image for Trusty ( ci-trusty-wikimedia ). Been stall for 283 hours (12 days)

2016-06-28

21:33 halfak: deploying ores beec291
21:15 halfak: deploying ores 6979a98

2016-06-27

22:32 eberhardson: deployment-prep deployed gerrit.wikimedia.org/r/296279 to puppetmaster to test kibana4 role
19:41 bd808: Rebooting deployment-logstash3.eqiad.wmflabs via wikitech. Console log full of blocked kworker messages, ssh non-responsive, and blocking logstash records being recorded.
18:20 thcipriani: deployment-puppetmaster.deployment-prep:/var/lib/git/labs/private modules/secret/secrets/keyholder keys conflicts resolved
18:09 bd808: Git repo at deployment-puppetmaster.deployment-prep:/var/lib/git/labs/private is behind upstream due to multiple modules/secret/secrets/keyholder local files that would be overwritten by upstream changes.

2016-06-24

15:04 hashar: switch apps-android-wikimedia-* jobs to Jessie T138506
14:07 James_F: Killed https://integration.wikimedia.org/ci/job/pywikibot-core-tox-nose-jessie/556/console (stuck for 90 minutes)
09:54 hashar: T138506 Adding a JDK installation "Debian - OpenJdk 8" in Jenkins global configuration with JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

2016-06-23

13:58 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/295691
12:13 hashar: Deleting integration-saltmaster and recreating it with Jessie T136410
10:14 hashar: T137807 Upgrading Jenkins TAP Plugin
08:55 hashar: integration: rebased puppet master by dropping a conflicting/obsolete patch
08:28 hashar: fixing puppet cert on deployment-cache-text04

2016-06-17

10:35 jzerebecki: offlined integration-slave-trusty-1015 T138074
10:06 hashar: Refreshed Nodepool Trusty image
10:02 hashar: Refreshed Nodepool Jessie image

2016-06-14

14:22 hashar: T136971 on tin MediaWiki 1.28.0-wmf.6, from 1.28.0-wmf.6, successfully checked out. Applying security patches
11:21 hashar: T137797 Created Gerrit repository operations/debs/geckodriver to package https://github.com/mozilla/geckodriver

2016-06-13

21:11 hashar: https://integration.wikimedia.org/ci/computer/integration-slave-trusty-1015/ put offline. Jenkins cant ssh / pool it for some reason
20:07 hashar: beta: update.php / database update finally pass!
19:55 hashar: T137615 deployment-db2, **eswiki** > CREATE INDEX echo_notification_event ON echo_notification (notification_event);
19:22 hashar: T137615 deployment-db2, enwiki > CREATE INDEX echo_notification_event ON echo_notification (notification_event);
10:37 hashar: Restarted puppetmaster on integration-puppetmaster (memory leak / can not fork: no memory)
10:35 hashar: T137561 salt -v '*trusty*' cmd.run "cd /root/ && dpkg -i firefox_46.0.1+build1-0ubuntu0.14.04.3_amd64.deb"
10:23 hashar: Hard reboot integration-slave-trusty-1015
08:30 hashar: Beta: `mwscript extensions/Echo/maintenance/removeInvalidTargetPage.php --wiki=enwiki` for T137615

2016-06-10

15:49 jzerebecki: reloading zuul for 8c048fb..272d1ec
15:29 jzerebecki: T137561 integration-puppetmaster:/var/lib/git/operations/puppet# git reset --hard 1e1ff12b13b73b5c5e2015a72f51561f10b305d0
15:19 jzerebecki: T137561 integration-saltmaster:~# salt -v '*trusty*' cmd.run "cd /root/ && dpkg -i firefox_46.0.1+build1-0ubuntu0.14.04.3_amd64.deb"
15:18 jzerebecki: T137561 integration-saltmaster:~# salt -v '*trusty*' cmd.run "cd /root/ && wget 'https://ubuntu.wikimedia.org/ubuntu/pool/main/f/firefox/firefox_46.0.1%2bbuild1-0ubuntu0.14.04.3_amd64.deb'"
15:15 jzerebecki: T137561 integration-puppetmaster:/var/lib/git/operations/puppet# git fetch https://gerrit.wikimedia.org/r/operations/puppet refs/changes/39/293739/1 && git cherry-pick FETCH_HEAD

2016-06-09

18:49 hashar: restarting nutcracker on deployment-mediawiki02
16:53 hashar: rebuild Nodepool trusty image ci-trusty-wikimedia-1465490962
16:37 hashar: Manually deleting old zuul references on scandium.eqiad.wmnet . Running in a screen
16:32 hashar: rebuild Nodepool jessie image ci-jessie-wikimedia-1465489579
16:03 hashar: Restarting Nodepool

2016-06-08

02:56 legoktm: / on gallium is read-only
02:47 legoktm: disabling/enabling gearman in jenkins because everything is stuck

2016-06-07

19:28 hashar: Nodepool has troubles spawning instances probably due to on going (?) labs maintenance
14:56 hashar: Restarting Jenkins to upgrade Rebuilder plugin with https://github.com/jenkinsci/rebuild-plugin/pull/34 (sort out parameters not being reinjected)
09:02 hashar: Upgrading Jenkins IRC plugin 2.25..2.27 and instant messaging plugin 1.34..1.35 . The former should fix a deadlock on shutdowning Jenkins | T96183

2016-06-06

19:26 hasharAway: Regenerating Nodepool snapshots for Trusty and Jessie
13:04 hashar: Migrated all qunit jobs to Nodepool T136301 has the related Gerrit changes
10:05 hashar: migrating mediawiki-core-qunit job to Nodepool instances https://gerrit.wikimedia.org/r/#/c/291322/ T136301

2016-06-04

00:09 Krinkle: krinkle@integration-slave-trusty-1017:~$ sudo rm -rf /mnt/jenkins-workspace/workspace/mediawiki-extensions-hhvm/src/extensions/Babel (T86730)

2016-06-03

19:18 hashar: Image ci-jessie-wikimedia-1464981111 in wmflabs-eqiad is ready Zend 5.x for qunit | T136301
15:17 hashar: refreshed Nodepool Trusty image due to some imagemagick upgrade issue. Image ci-trusty-wikimedia-1464966671 in wmflabs-eqiad is ready
10:40 hashar: scandium (zuul merger): rm -fR /srv/ssd/zuul/git/mediawiki/extensions/Collection T136930

2016-06-02

12:10 hashar: Upgraded Zuul upstream code being 66c8e52..30a433b package is 2.1.0-151-g30a433b-wmf1precise1

2016-06-01

17:49 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/292186
16:45 tgr: enabling AuthManager on beta cluster
15:20 legoktm: deploying https://gerrit.wikimedia.org/r/292153
14:44 twentyafterfour: jenkins restart completed
14:36 twentyafterfour: restarting jenkins to install "single use slave" plugin (jenkins will restart when all builds are finished)
13:49 hashar: Beta : clearing temporary files under /data/project/upload7 (mainly wikimedia/commons/temp )
10:29 hashar: Upgraded Linux kernel on deployment-salt02 T136411
10:14 hashar: beta: salt-key -d deployment-salt.deployment-prep.eqiad.wmflabs T136411
09:16 hashar: Enabling puppet again on Trusty slaves. Chromium is now properly pinned to version 49 ( https://gerrit.wikimedia.org/r/#/c/291116/3 | T136188 )
08:55 hashar: integration slaves : salt -v '*' pkg.upgrade

2016-05-31

20:24 bd808: Reloading zuul to pick up I58f878f3fd19dfa21a46a52464575cb06aacbb22

2016-05-30

18:39 hashar: Upgraded our Jenkins Job Builder fork to 1.5.0 + a couple of cherry picks: cd63874...10f2bcd
12:53 hashar: Upgrading Zuul 1cc37f7..66c8e52 T128569
08:04 ori: zuul is back up but jobs which were enqueued are gone
07:50 ori: restarting jenkins on gallium, too
07:49 ori: restarted zuul-merger service on gallium
07:44 ori: Disconnecting and then reconnecting Gearman from Jenkins did not appear to do anything; going to depool / repool nodes.
07:42 ori: Temporarily disconnecting Gearman from Jenkins, per <https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Known_issues>

2016-05-28

04:43 ori: depooling integration-slave-trusty-1015 to profile phpunit runs

2016-05-27

19:29 hasharAway: Refreshed Nodepool images
18:13 thcipriani: restarting zuul for deadlock
18:00 thcipriani: Reloading Zuul to deploy I0c3aeacf92d430ad1272f5f00e7fb7182b8a05bf
02:55 bd808: Deleted deployment-fluorine:/srv/mw-log/archive/*-20160[34]* logs; freed 26G

2016-05-26

22:23 hashar: salt -v '*trusty*' cmd.run 'puppet agent --disable "Chromium needs to be v49. See T136188"'
21:47 hashar: integration-slave-trusty-1015 still on Chromium 50 .. T136188
21:42 hashar: downgrading chromium-browser on integration-slave-1015 T136188
09:24 jzerebecki: reloading zuul for d38ad0a..6798539
07:48 gehel: deployment-prep upgrading elasticsearch to 2.3.3 and restarting (T133124)
07:36 dcausse: deployment-prep elastic: updating cirrussearch warmers (T133124)
07:31 gehel: deployment-prep deploying new elasticsearch plugins (T133124)

2016-05-25

22:38 Amir1: running puppet agent manually on sca01
16:26 hashar: 2016-05-25 16:24:35,491 INFO nodepool.image.build.wmflabs-eqiad.ci-trusty-wikimedia: Notice: /Stage[main]/Main/Package[ruby-jsduck]/ensure: ensure changed 'purged' to 'present' T109005
15:07 hashar: g++ added to Jessie and Trusty Nodepool instances | T119143
14:12 hashar: Regenerating Nodepool snapshot to include g++ which is required by some NodeJS native modules T119143
10:58 hashar: Updating Nodepool ci-jessie-wikimedia snapshot image to get netpbm package installed into it. T126992 https://gerrit.wikimedia.org/r/290651
09:30 hashar: Clearing git-sync-upstream script on integration-slave-trusty1013 and integration-slave-trusty-1017. That is only supposed to be on the puppetmaster
09:15 hashar: Fixed resolv.conf on integration-slave-trusty-1013 and force running puppet to catch up with change since May 16 19:52
09:11 hashar: restarting puppetmaster on integration-puppetmaster ( memory leak / can not fork)

2016-05-24

07:03 mobrovac: rebooting deployment-tin, can't log in

2016-05-23

19:35 hashar: killed all mysqld process on Trusty CI slaves
15:49 thcipriani: beta code update not running, disconnect-reconnect dance resulted in: [05/23/16 15:48:39] [SSH] Authentication failed.
14:32 jzerebecki: offlined integration-slave-trusty-1004 because it can't connect to mysql T135997
13:32 hashar: Upgrading Jenkins git plugins and restarting Jenkins
11:01 hashar: Upgrading hhvm on Trusty slaves. Bring him hhvm compiled against libicu52 instead of libicu48
09:12 _joe_: deployment-prep: all hhvm hosts in beta upgraded to run on the newer libicu; now running updateCollation.php (T86096)
09:11 hashar: Image ci-jessie-wikimedia-1463994307 in wmflabs-eqiad is ready
09:01 hashar: Image ci-trusty-wikimedia-1463993508 in wmflabs-eqiad is ready
08:56 _joe_: deployment-prep: starting upgrade of HHVM to a version linked to libicu52, T86096
08:54 hashar: Regenerating Nodepool image manually. Broke over the week-end due to a hhvm/libicu transition. Should get pip 8.1.x now

2016-05-20

20:30 bd808: Killing https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit/43608/ which has been running for 5 hours

2016-05-19

16:47 thcipriani: deployment-tin jenkins worker seems to be back online after some prodding
16:41 thcipriani: beta-code-update eqiad hung for past few hours
15:16 hashar: Restarted zuul-merger daemons on both gallium and scandium : file descriptors leaked
11:59 hashar: CI: salt -v '*' cmd.run 'pip install --upgrade pip==8.1.2'
11:54 hashar: Upgrading pip on CI slaves from 7.0.1 to 8.1.2 https://gerrit.wikimedia.org/r/#/c/289639/
10:15 hashar: puppet broken on deployment-tin : ?[1;31mError: Could not retrieve catalog from remote server: Error 400 on SERVER: Invalid parameter trusted_group on node deployment-tin.deployment-prep.eqiad.wmflabs?[0m

2016-05-18

13:16 Amir1: deploying a05e830 to ores nodes (sca01 and ores-web)
12:46 urandom: (re)cherry-picking c/284078 to deployment-prep
11:36 hashar: Restarted qa-morebots
11:36 hashar: Marked mediawiki/core/vendor repository has hidden in Gerrit. It got moved to mediawiki/vendor including the whole history Settings page: https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/core/vendor

2016-05-13

14:39 thcipriani: remove shadow l10nupdate user from deployment-tin and mira in beta
10:20 hashar: Put integration-slave-trusty-1004 offline. Ssh/passwd is borked T135217
09:59 hashar: Deleting non nodepool mediawiki PHPUnit jobs for T135001 (mediawiki-phpunit-hhvm mediawiki-phpunit-parsertests-hhvm mediawiki-phpunit-parsertests-php55 mediawiki-phpunit-php55)
04:06 thcipriani|afk: changed ownership of mwdeploy public keys post shadow mwdeploy user removal is important
03:47 thcipriani|afk: ldap failure has created a shadow mwdeploy user on beta, deleted using vipw

2016-05-12

22:53 bd808: Started dead mysql on integration-slave-precise-1011

2016-05-11

21:05 hashar: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/288128 #T134946
20:26 hashar: rebooting integration-slave-trusty-1016 is back up
20:15 hashar: rebooting integration-slave-trusty-1016 unreachable somehow
16:43 hashar: Reduced number of executors on Trusty instances from 3 to 2. Memory get exhausted causing the tmpfs to drop files and thus MW jobs to fail randomly.
13:33 hashar: Added contint::packages::php to Nodepool images T119139
12:59 hashar: Dropping texlive and its dependencies from gallium.
12:52 hashar: deleted integration-dev
12:51 hashar: creating integration-dev instance to hopefully have Shinken clean itself
11:42 hashar: rebooting deployment-aqs01 via wikitech T134981
10:46 hashar: beta/ci puppetmaster : deleting old tags in /var/lib/git/operations/puppet and repacking the repos
08:49 hashar: Deleting instances deployment-memc02 and deployment-memc03 (Precise instances, migrated to Jessie) #T134974
08:43 hashar: Beta: switching memcached to new Jessie servers by cherry picking https://gerrit.wikimedia.org/r/#/c/288156/ and running puppet on mw app servers #T134974
08:20 hashar: Creating deployment-memc04 and deployment-memc05 to switch beta cluster memcached to Jessie. m1.medium with security policy "cache" T13497
01:44 matt_flaschen: Created Flow-specific External Store tables (blobs_flow1) on all wiki databases on Beta Cluster: T128417

2016-05-10

19:17 hashar: beta / CI purging old Linux kernels: salt -v '*' cmd.run 'dpkg -l|grep ^rc|awk "{ print \$2 }"|grep linux-image|xargs dpkg --purge'
17:34 cscott: updated OCG to version b0c57a1c6890e9fa1f2c3743fc14cb6a7f244fc3
16:44 bd808: Cleaned up 8.5G of pbuilder tmp output on integration-slave-jessie-1001 with `sudo find /mnt/pbuilder/build -maxdepth 1 -type d -mtime +1 -exec rm -r {} \+`
16:35 bd808: https://integration.wikimedia.org/ci/job/debian-glue failure on integration-slave-jessie-1001 due to /mnt being 100$ full
14:20 hashar: deployment-puppetmaster mass cleaned packages/service/users etc T134881
13:54 moritzm: restarted zuul-merger on scandium for openssl update
13:52 moritzm: restarting zuul on gallium for openssl update
13:51 moritzm: restarted apache and zuul-merger on gallium for openssl update
13:48 hashar: deployment-puppetmaster : dropping role::ci::jenkins_access role::ci::slave::labs and role::ci::slave::labs::common T134881
13:46 hashar: Deleting Jenkins slave deployment-puppetmaster T134881
13:45 hashar: Change https://integration.wikimedia.org/ci/job/beta-build-deb/ job to use label selector "DebianGlue && DebianJessie" instead of "BetaDebianRepo" T134881
13:33 hashar: Migrating all debian glue jobs to Jessie permanent slaves T95545
13:30 hashar: Adding integration-slave-jessie-1002 in Jenkins. it is all puppet compliant
12:59 thcipriani|afk: triggering puppet run on scap targets in beta for https://gerrit.wikimedia.org/r/#/c/287918/ cherry pick
09:07 hashar: fixed puppet.conf on deployment-cache-text04

2016-05-09

20:58 hashar: Unbroke puppet on integration-raita.integration.eqiad.wmflabs . Puppet was blocked because role::ci::raita was no more. Fixed by rebasing https://gerrit.wikimedia.org/r/#/c/208024 T115330
20:13 hashar: beta: salt -v '*' cmd.run 'dpkg --purge libganglia1 ganglia-monitor; rm -fR /etc/ganglia' # T134808
20:06 hashar: CI, removing ganglia configuration entirely via: salt -v '*' cmd.run 'rm -fRv /etc/ganglia' # T134808
20:04 hashar: CI, removing ganglia configuration entirely via: salt -v '*' cmd.run 'dpkg --purge ganglia-monitor' # T134808
16:32 jzerebecki: reloading zuul for 3e2ab56..d663fd0
15:39 andrewbogott: migrating deployment-flourine to labvirt1009
15:39 hashar: Adding label contintLabsSlave to integration-slave-jessie1001 and integration-slave-jessie1002
15:26 hashar: Creating integration-slave-jessie-1001 T95545

2016-05-06

19:45 urandom: Restart cassandra-metrics-collector on deployment-restbase0[1-2]
19:41 urandom: Rebasing 02ae1757 on deployment-puppetmaster : T126629

2016-05-05

22:09 MaxSem: Promoted Yurik and Jgirault to sysops on beta enwiki. Through shell because logging in is broken for me.

2016-05-04

21:28 cscott: deployed puppet FQDN domain patch for OCG: https://gerrit.wikimedia.org/r/286068 and restarted ocg on deployment-pdf0[12]
15:03 hashar: beta-scap: deployment-tin.deployment-prep.eqiad.wmflabs Name or service not known
15:03 hashar: beta-scap: deployment-tin.deployment-prep.eqiad.wmflabs
12:24 hashar: deleting Jenkins job mediawiki-core-phpcs , replaced by Nodepool version mediawiki-core-phpcs-trusty T133976
12:11 hashar: beta: restarted nginx on varnish caches ( systemctl restart nginx.service ) since they were not listening on port 443 #T134362
11:07 hashar: restarted CI puppetmaster (out of memory leak)
10:57 hashar: CI: mass upgrading deb packages
10:53 hashar: beta: clearing out leftover apt conf that points to unreachable web proxy : salt -v '*' cmd.run "find /etc/apt -name '*-proxy' -delete"
10:48 hashar: Manually fixing nginx upgrade on deployment-cache-text04 and deployment-cache-upload04 see T134362 for details
09:27 hashar: deployment-cache-text04 systemctl stop varnish-frontend.service . To clear out all the stuck CLOSE_WAIT connections T134346
08:33 hashar: fixed puppet on deployment-cache-text04 (race condition generating puppet.conf )

2016-05-03

23:21 bd808: Changed "Maximum Number of Retries" for ssh agent launch in jenkins for deployment-tin from "0" to "10"
23:01 twentyafterfour: rebooting deployment-tin
23:00 bd808: Jenkins agent on deployment-tin not spawning; investigating
20:02 hashar: Restarting Jenkins
16:49 hashar: Notice: /Stage[main]/Contint::Packages::Python/Package[pypy]/ensure: ensure changed 'purged' to 'present' | T134235
16:46 hashar: Refreshing Nodepool Jessie image to have it include pypy | T134235 poke @jayvdb
14:49 mobrovac: deployment-tin rebooting it
14:25 hashar: beta salt -v '*' pkg.upgrade
14:19 hashar: beta: added unattended upgrade to Hiera::deployment-prep
13:30 hashar: Restarted nslcd on deployment-tin , pam was refusing authentication for some reason
13:29 hashar: beta: got rid of a leftover Wikidata/Wikibase patch that broke scap salt -v 'deployment-tin*' cmd.run 'sudo -u jenkins-deploy git -C /srv/mediawiki-staging/php-master/extensions/Wikidata/ checkout -- extensions/Wikibase/lib/maintenance/populateSitesTable.php'
13:23 hashar: deployment-tin force upgraded HHVM from 3.6 to 3.12
09:42 hashar: adding puppet class contint::slave_scripts to deployment-sca01 and deployment-sca02 . Ships multigit.sh T134239
09:31 hashar: Deleting CI slave deployment-cxserver03 , added deployment-sca01 and deployment-sca02 in Jenkins. T134239
09:28 hashar: deployment-sca01 removing puppet lock /var/lib/puppet/state/agent_catalog_run.lock and running puppet again
09:26 hashar: Applying puppet class role::ci::slave::labs::common on deployment-sca01 and deployment-sca02 (cxserver and parsoid being migrated T134239 )
03:33 kart_: Deleted deployment-cxserver03, replaced by deployment-sca0x

2016-05-02

21:27 cscott: updated OCG to version b775e612520f9cd4acaea42226bcf34df07439f7
21:26 hashar: Nodepool is acting just fine: Demand from gearman: ci-trusty-wikimedia: 457 | <AllocationRequest for 455.0 of ci-trusty-wikimedia>
21:25 hashar: restarted qa-morebots "2016-05-02 21:22:23,599 ERROR: Died in main event loop"
21:23 hashar: gallium: enqueued 488 jobs directly in Gearman. That is to test https://gerrit.wikimedia.org/r/#/c/286462/ ( mediawiki/extensions to hhvm/zend5.5 on Nodepool). Progress /home/hashar/gerrit-286462.log
20:14 hashar: MediaWiki phpunit jobs to run on Nodepool instances \O/
16:41 urandom: Forcing puppet run and restarting Cassandra on deployment-restbase0[1-2] : T126629
16:40 urandom: Cherry-picking https://gerrit.wikimedia.org/r/operations/puppet refs/changes/78/284078/12 to deployment-puppetmaster : T126629
16:24 urandom: Restarat Cassandra on deployment-restbase0[1-2] : T126629
16:21 urandom: forcing puppet run on deployment-restbase0[1-2] : T126629
16:21 urandom: cherry-picking latest refs/changes/78/284078/11 onto deployment-puppetmaster : T126629
09:44 hashar: On zuul-merger instances (gallium / scandium), cleared out pywikibot/core working copy ( rm -fR /srv/ssd/zuul/git/pywikibot/core/ ) T134062

2016-04-30

18:31 Amir1: deploying d4f63a3 from github.com/wiki-ai/ores-wikimedia-config into targets in beta cluster via scap3

2016-04-29

16:37 jzerebecki: restarting zuul for 4e9d180..ebb191f
15:45 hashar: integration: deleting integration-trusty-1026 and cache-rsync . Maybe that will clear them up from Shinken
15:14 hashar: integration: created 'cache-rsync' and 'integration-trusty-1026' , attempting to have Shinken to deprovision them

2016-04-28

22:03 urandom: deployment-restbase01 upgrade to 2.2.6 complete : T126629
21:56 urandom: Stopping Cassandra on deployment-restbase01, upgrading package to 2.2.6, and forcing puppet run : T126629
21:55 urandom: Snapshotting Cassandra tables on deployment-restbase01 (name = 1461880519833) : T126629
21:55 urandom: Snapshotting Cassandra tables on deployment-restbase01 : T126629
21:52 urandom: Forcing puppet run on deployment-restbase02 : T126629
21:51 urandom: Cherry picking operations/puppet refs/changes/78/284078/10 to puppmaster : T126629
20:46 urandom: Starting Cassandra on deployment-restbase02 (now v2.2.6) : T126629
20:41 urandom: Re-enable puppet and force run on deployment-restbase02 : T126629
20:38 urandom: Halting Cassandra on deployment-restbase02, masking systemd unit, and upgrading package(s) to 2.2.6 : T126629
20:37 urandom: Snapshotting Cassandra tables on deployment-restbase02 (snapshot name = 1461875833996) : T126629
20:37 urandom: Snapshotting Cassandra tables on deployment-restbase02 : T126629
20:33 urandom: Cassandra on deployment-restbase01.deployment-prep started : T126629
20:25 urandom: Restarting Cassandra on deployment-restbase01.deployment-prep : T126629
20:14 urandom: Re-enable puppet on deployment-restbase01.deployment-prep, and force a run : T126629
20:12 urandom: cherry-picking https://gerrit.wikimedia.org/r/#/c/284078/ to deployment-puppetmaster : T126629
20:06 urandom: Disabling puppet on deployment-restbase0[1-2].deployment-prep : T126629
14:43 hashar: Rebuild Nodepool Jessie image. Comes with hhvm
12:52 hashar: Puppet is happy on deployment-changeprop
12:47 hashar: apt-get upgrade deployment-changeprop (outdated exim package)
12:42 hashar: Rebuild Nodepool Trusty instance to include the PHP wrapper script T126211

2016-04-27

23:57 thcipriani: nodepool instances running again after an openstack rabbitmq restart by andrewbogott
22:51 duploktm: also ran openstack server delete ci-jessie-wikimedia-85342
22:42 legoktm: nodepool delete 85342
22:41 matt_flaschen: Deployed https://gerrit.wikimedia.org/r/#/c/285765/ to enable External Store everywhere on Beta Cluster
22:38 legoktm: stop/started nodepool
22:36 thcipriani: I don't have permission to restart nodepool
22:35 thcipriani: restarting nodepool
22:18 matt_flaschen: Deployed https://gerrit.wikimedia.org/r/#/c/282440/ to switch Beta Cluster to use External Store for new testwiki writes
21:00 hashar: thcipriani downgraded git plugins successfully (we wanted to rule out their upgrade for some weird issue)
20:13 cscott: updated OCG to version e39e06570083877d5498da577758cf8d162c1af4
14:10 hashar: restarting Jenkins
14:09 hashar: Jenkins upgrading credential plugin 1.24 > 1.27 And Credentials binding plugin 1.6 > 1.7
14:07 hashar: Jenkins upgrading git plugin 2.4.1 > 2.4.4
14:01 hashar: Jenkins upgrading git client plugin 1.19.1. > 1.19.6
13:13 jzerebecki: reloading zuul for 81a1f1a..0993349
11:43 hashar: fixed puppet on deployment-cache-text04 T132689
10:38 hashar: Rebuild Image ci-trusty-wikimedia-1461753210 in wmflabs-eqiad is ready
09:43 hashar: tmh01.deployment-prep.eqiad.wmflabs denies mwdeploy user breaking https://integration.wikimedia.org/ci/job/beta-scap-eqiad/

2016-04-26

20:45 hashar: Regenerating Nodepool Jessie snapshot to include composer and HHVM | T128092
20:23 jzerebecki: reloading zuul for eb480d8..81a1f1a
19:25 jzerebecki: reload zuul for 4675213..eb480d8
19:25 jzerebecki: 4675213..eb480d8
14:18 hashar: Applied security patches to 1.27.0-wmf.22 | T131556
12:39 hashar: starting cut of 1.27.0-wmf.22 branch ( poke ostriches )
10:29 hashar: restored integration/phpunit on CI slaves due to https://integration.wikimedia.org/ci/job/operations-mw-config-phpunit/ failling
09:11 hashar: CI is back up!
08:20 hashar: shutoff instance castor, does not seem to be able to start again :( | T133652
08:12 hashar: hard rebooting castor instance | T133652
08:10 hashar: soft rebooting castor instance | T133652
08:06 hashar: CI jobs deadlocked due to castor being unavailable | https://phabricator.wikimedia.org/T133652
00:46 thcipriani: temporary keyholder fix in place in beta
00:18 thcipriani: beta-scap-eqiad failure due to bad keyholder-auth.d fingerprints

2016-04-25

20:58 cscott: updated OCG to version 58a720508deb368abfb7652e6a8c7225f95402d2
19:46 hashar: Nodepool now has a couple trusty instances intended to experiment with Zend 5.5 / HHVM migration . https://phabricator.wikimedia.org/T133203#2236625
13:34 hashar: Nodepool is attempting to create a Trusty snapshot with name ci-trusty-wikimedia-1461591203 | T133203
13:15 hashar: openstack image create --file /home/hashar/image-trusty-20160425T124552Z.qcow2 ci-trusty-wikimedia --disk-format qcow2 --property show=true # T133203
10:38 hashar: Refreshing Nodepool Jessie snapshot based on new image
10:35 hashar: Refreshed Nodepool Jessie image ( image-jessie-20160425T100035Z )
09:24 hashar: beta / scap failure filled as T133521
09:20 hashar: Keyholder / mwdeploy ssh keys have been messed up on beta cluster somehow :-(
08:47 hashar: mwdeploy@deployment-tin has lost ssh host keys file :(

2016-04-24

17:14 jzerebecki: reloading e06f1fe..672fc84

2016-04-22

18:13 legoktm: deploying https://gerrit.wikimedia.org/r/284841
08:13 legoktm: deploying https://gerrit.wikimedia.org/r/284860

2016-04-21

19:07 thcipriani: scap version testing should be done, puppet should no longer be disabled on hosts
18:02 thcipriani: disabling puppet on scap targets to test scap_3.1.0-1+0~20160421173204.70~1.gbp6706e0_all.deb

2016-04-20

22:28 thcipriani: rolling back scap version in beta, legit failure :(
21:52 thcipriani: testing new scap version in beta on deployment-tin
17:54 thcipriani: Reloading Zuul to deploy gerrit:284494
13:58 hashar: Stopping HHVM on CI slaves by cherry picking a couple puppet patches | T126594
13:33 hashar: salt -v '*trusty*' cmd.run 'rm /usr/lib/x86_64-linux-gnu/hhvm/extensions/current' # Cleanup on CI slaves for T126658
13:27 hashar: Restarted integration puppet master service (out of memory / mem leak)

2016-04-17

01:01 legoktm: deploying https://gerrit.wikimedia.org/r/283837

2016-04-16

14:21 Krenair: restarted qa-morebots per request
14:18 Krenair: <jzerebecki> !log reloading zuul for 3f64dbd..c6411a1

2016-04-13

01:48 legoktm: deploying https://gerrit.wikimedia.org/r/282952

2016-04-12

19:47 bd808: Cleaned up large hhbc cache file on deployment-medaiwiki03 via `sudo service hhvm stop; sudo rm /var/cache/hhvm/fcgi.hhbc.sq3; sudo service hhvm start`
19:47 bd808: Cleaned up large hhbc cache file on deployment-medaiwiki02 via `sudo service hhvm stop; sudo rm /var/cache/hhvm/fcgi.hhbc.sq3; sudo service hhvm start`
19:46 bd808: Cleaned up large hhbc cache file on deployment-medaiwiki01 via `sudo service hhvm stop; sudo rm /var/cache/hhvm/fcgi.hhbc.sq3; sudo service hhvm start`
19:10 Amir1: manually rebooted deployment-ores-web
19:08 Amir1: manually cherry-picked 282992/2 into to puppetmaster
17:05 Amir1: ran puppet agen in sca01 manually in /srv directory
11:34 hashar: Jenkins upgrading "Script Security Plugin" from 1.17 to 1.18.1 https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2016-04-11

2016-04-11

21:23 csteipp: deployed and reverted oath
20:30 thcipriani: relaunched slave-agent on integration-slave-trusty-1025, back online
20:19 thcipriani: integration-slave-trusty-1025 horizon console filled with INFO: task jbd2/vda1-8:170 blocked for more than 120 seconds. rebooting
20:13 thcipriani: killing stuck jobs, marking integration-slave-trusty-1025 as offline temporarily
14:42 thcipriani: deployment-mediawiki01 disk full :(

2016-04-08

22:46 matt_flaschen: Created blobs1 table for all wiki DBs on Beta Cluster
14:34 hashar: Image ci-jessie-wikimedia-1460125717 in wmflabs-eqiad is ready adds package 'unzip' | T132144
12:49 hashar: Image ci-jessie-wikimedia-1460119481 in wmflabs-eqiad is ready , adds package 'zip' | T132144
09:30 hashar: Removed label hasAndroidSdk from gallium . That prevent that slave from sometime running the job apps-android-commons-build
08:42 hashar: Rebased puppet master and fixed conflict with https://gerrit.wikimedia.org/r/#/c/249490/

2016-04-07

20:16 hashar: deployment-mediawiki02.deployment-prep.eqiad.wmflabs , cleared up random left over stuff / big logs etc
20:08 hashar: deployment-mediawiki02.deployment-prep.eqiad.wmflabs / is full

2016-04-05

23:56 marxarelli: Removed cherry-pick and rebased /var/lib/git/operations/puppet on integration-puppetmaster after merge of https://gerrit.wikimedia.org/r/#/c/281706/
21:58 marxarelli: Restarting puppetmaster on integration-puppetmaster
21:53 marxarelli: Cherry picked https://gerrit.wikimedia.org/r/#/c/281706/ on integration-puppetmaster and applying on integration-slave-trusty-1014
10:32 hashar: gallium removing texlive
10:29 hashar: gallium removing libav / ffmpeg. No more needed since jobs are no more running on that server

2016-04-04

17:30 greg-g: Phabricator going down in about 10 minutes to hopefully address the overheating issue: T131742
10:06 hashar: integration: salt -v '*-slave*' cmd.run 'rm /usr/local/bin/grunt; rm -fR /usr/local/lib/node_modules/grunt-cli' | T124474
10:04 hashar: integration: salt -v '*-slave*' cmd.run 'npm -g uninstall grunt-cli' | T124474
03:15 greg-g: Phabricator is down

2016-04-03

07:02 legoktm: deploying https://gerrit.wikimedia.org/r/281079
03:16 Amir1: manually rebooted deployment-ores-web and deployment-sca01

2016-04-02

22:58 Amir1: added local hack to pupetmaster to make scap3 provider more verbose
19:46 hashar: Upgrading Jenkins Gearman plugin to v2.0 , bring in diff registration for faster updates of Gearman server
14:39 Amir1: manually added 281170/5 to beta puppetmaster
14:22 Amir1: manually added 281161/1 to beta puppetmaster
11:31 Reedy: deleted archived logs older than 30 days from deployment-fluorine

2016-04-01

22:16 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/281046
21:13 hashar: Image ci-jessie-wikimedia-1459544873 in wmflabs-eqiad is ready
20:57 hashar: Refreshing Nodepool snapshot to hopefully get npm 2.x installed T124474
20:37 hashar: Added Luke081515 as a member of deployment-prep (beta cluster) labs project
20:31 hashar: Dropping grunt-cli from the permanent slaves. People can have it installed by listing it in their package.json devDependencies https://gerrit.wikimedia.org/r/#/c/280974/
14:06 hashar: integration: removed sudo policy permitting sudo as any member of the project for any member of the project, which included jenkins-deploy user
14:05 hashar: integration: removed sudo policy permitting sudo as root for any member of the project, which included jenkins-deploy user
11:23 bd808: Freed 4.5G on deployment-fluorine:/srv/mw-log by deleting wfDebug.log
04:00 Amir1: manually rebooted deployment-sca01
00:16 csteipp: created oathauth_users table on centralauth db in beta

2016-03-31

21:19 legoktm: deploying https://gerrit.wikimedia.org/r/280756
13:52 hashar: rebasing integration puppetmaster (it had some merge commit )
01:40 Krinkle: Purge npm cache in integration-slave-trusty-1015:/mnt/home/jenkins-deploy/.npm was corrupted around March 23 19:00 for unknown reasons (T130895)

2016-03-30

19:32 twentyafterfour: deleted some nutcracker and hhvm log files on deployment-mediawiki01 to free space
15:37 hashar: Gerrit has trouble sending emails T131189
13:48 Reedy: deployment-prep Make that deployment-tmh01
13:48 Reedy: deployment-prep upgrade hhvm on deployment-mediawiki01 and reboot
13:35 Reedy: deployment-prep upgrade hhvm on deployment-mediawiki03 and reboot
12:16 gehel: deployment-prep restarting varnish on deployment-cache-text04
11:04 Amir1: cherry-picked 280413/1 in beta puppetmaster, manually running puppet agent in deployment-ores-web
10:22 Amir1: cherry-picking 280403 to beta puppetmaster and manually running puppet agent in deployment-ores-web

2016-03-29

23:22 marxarelli: running jenkins-jobs update config/ 'mwext-donationinterfacecore125-testextension-zend53' to deploy https://gerrit.wikimedia.org/r/#/c/280261/
19:52 Amir1: manually updated puppetmaster, deleted SSL cert key in deployment-ores-web in VM, running puppet agent manually
02:20 jzerebecki: reloading zuul fo 46923c8..c0937ee

2016-03-26

22:38 jzerebecki: reloading zuul for 2d7e050..46923c8

2016-03-25

23:55 marxarelli: deleting instances integration-slave-trusty-1002 and integration-slave-trusty-1005
23:54 marxarelli: deleting jenkins nodes integration-slave-trusty-1002 and integration-slave-trusty-1005
23:41 marxarelli: completed rolling manual deploy of https://gerrit.wikimedia.org/r/#/c/279640/ to trusty slaves
23:27 marxarelli: starting rolling offline/remount/online of trusty slaves to increase tmpfs size
23:22 marxarelli: pooled new trusty slaves integration-slave-trusty-1024 and integration-slave-trusty-1025
23:13 jzerebecki: reloading zuul fro 0aec21d..2d7e050
22:14 marxarelli: creating new jenkins node for integration-slave-trusty-1024
22:11 marxarelli: rebooting integration-slave-trusty-{1024,1025} before pooling as replacements for trusty-1002 and trusty-1005
21:06 marxarelli: repooling integration-slave-trusty-{1005,1002} to help with load while replacement instances are provisioning
16:59 marxarelli: depooling integration-slave-trusty-1002 until DNS resolution can be resolved. still investigating disk space issue

2016-03-24

16:39 thcipriani: restarted rsync service on deployment-tin
13:45 thcipriani|afk: rearmed keyholder on deployment-tin
04:41 Krinkle: beta-update-databases-eqiad and beta-scap-eqiad stuck for over 8 hours (IRC notifier plugin deadlock)
03:28 Krinkle: beta-mediawiki-config-update-eqiadqueued has been stuck for over 5 hours.

2016-03-23

23:00 Krinkle: rm-rf integration-slave-trusty-1013:/mnt/home/jenkins-deploy/tmpfs/jenkins-2/karma-54925082/ (bad permissions, caused Karma issues)
19:02 legoktm: restarted zuul

2016-03-22

17:40 legoktm: deploying https://gerrit.wikimedia.org/r/278926

2016-03-21

21:55 hashar: zuul: almost all MediaWiki extensions migrated to run the npm job on Nodepool (with Node.js 4.3) T119143 . All tested. Will monitor the build results that ran overnight tomorrow
20:28 hashar: Mass running npm-node-4.3 jobs against MediaWiki extensions to make sure they all pass ( https://gerrit.wikimedia.org/r/#/c/278004/ | T119143 )
17:40 elukey: executed git rebase --interactive on deployment-puppetmaster.deployment-prep.eqiad.wmflabs to remove https://gerrit.wikimedia.org/r/#/c/278713/
15:46 elukey: hacked manually the cdh puppet submodule on deployment-puppetmaster.deployment-prep.eqiad.wmflabs - please let me know if interfere with anybody's tests
14:24 elukey: executed git submodule update --init on deployment-puppetmaster.deployment-prep.eqiad.wmflabs
11:25 elukey: beta: cherry picked https://gerrit.wikimedia.org/r/#/c/278713/ to test an updated to the cdh module (analytics)
11:13 hashar: beta: rebased puppet master which had a conflict on https://gerrit.wikimedia.org/r/#/c/274711/ which got merged meanwhile (saves Elukey )
11:02 hashar: beta: added Elukey (wikimedia ops) to the project as member and admin

2016-03-19

13:04 hashar: Jenkins: added ldap-labs-codfw.wikimedia.org as a fallback LDAP server T130446

2016-03-18

17:16 jzerebecki: reloading zuul for e33494f..89a9659

2016-03-17

21:10 thcipriani: updating scap on deployment-tin to test D133
18:31 cscott: updated OCG to version c1a8232594fe846bd2374efd8f7c20d7e97ac449
09:34 hashar: deployment-jobrunner01 deleted /var/log/apache/*.gz T130179
09:04 hashar: Upgrading hhvm and related extensions on jobrunner01 T130179

2016-03-16

14:28 hashar: Updated jobs having the package manager cache system (castor) via https://gerrit.wikimedia.org/r/#/c/277774/

2016-03-15

15:17 jzerebecki: added wikidata.beta.wmflabs.org in https://wikitech.wikimedia.org/wiki/Special:NovaAddress to deployment-cache-text04.deployment-prep.eqiad.wmflabs
14:19 hashar: Image ci-jessie-wikimedia-1458051246 in wmflabs-eqiad is ready T124447
14:14 hashar: Refreshing Nodepool snapshot images so it get a fresh copy of slave-scripts T124447
14:08 hashar: Deploying slave script change https://gerrit.wikimedia.org/r/#/c/277508/ "npm-install-dev.py: Use config.dev.yaml instead of config.yaml" for T124447

2016-03-14

22:18 greg-g: new jobs weren't processing in Zuul, lego fixed it and blamed Reedy
20:13 hashar: Updating Jenkins jobs mwext-Wikibase-* so they no more rely on --with-phpunit ( ping @hoo https://gerrit.wikimedia.org/r/#/c/277330/ )
17:03 Krinkle: Doing full Zuul restart due to deadlock (T128569)
10:18 moritzm: re-enabled systemd unit for logstash on deployment-logstash2

2016-03-11

22:42 legoktm: deploying https://gerrit.wikimedia.org/r/276901
19:41 legoktm: legoktm@integration-slave-trusty-1001:/mnt/jenkins-workspace/workspace$ sudo rm -rf mwext-Echo-testextension-* # because it was broken

2016-03-10

20:22 hashar: Nodepool Image ci-jessie-wikimedia-1457641052 in wmflabs-eqiad is ready
20:19 hashar: Refreshing Nodepool to include the 'varnish' package T128188
20:05 hashar: apt-get upgrade integration-slave-jessie1001 (bring in ffmpeg update and nodejs among other things)
12:22 hashar: Nodeppol Image ci-jessie-wikimedia-1457612269 in wmflabs-eqiad is ready
12:18 hashar: Nodepool: rebuilding image to get mathoid/graphoid packages included (hopefully) T119693 T128280

2016-03-09

17:56 bd808: Cleaned up git clone state in deployment-tin.deployment-prep:/srv/mediawiki-staging/php-master and queued beta-code-update-eqiad to try again (T129371)
17:48 bd808: Git clone at deployment-tin.deployment-prep:/srv/mediawiki-staging/php-master in completely horrible state. Investigating
17:22 bd808: Fixed https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/4452/
17:19 bd808: Manually cleaning up broken rebase in deployment-tin.deployment-prep:/srv/mediawiki-staging
16:27 bd808: Removed cherry-pick of https://gerrit.wikimedia.org/r/#/c/274696 ; manually cleaned up systemd unit and restarted logstash on deployment-logstash2
14:59 hashar: Image ci-jessie-wikimedia-1457535250 in wmflabs-eqiad is ready T129345
14:57 hashar: Rebuilding snapshot image to get Xvfb enabled at boot time T129345
13:04 moritzm: cherrypicked patch to deployment-prep which provides a systemd unit for logstash
10:52 hashar: Image ci-jessie-wikimedia-1457520493 in wmflabs-eqiad is ready
10:29 hashar: Nodepool: created new image and refreshing snapshot in attempt to get Xvfb running T129320 T128090

2016-03-08

23:42 legoktm: running CentralAuth's checkLocalUser.php --verbose=1 --delete=1 on deployment-tin for T115198
21:33 hashar: Nodepool Image ci-jessie-wikimedia-1457472606 in wmflabs-eqiad is ready
19:23 hashar: Zuul inject DISPLAY https://gerrit.wikimedia.org/r/#/c/273269/
16:03 hashar: Image ci-jessie-wikimedia-1457452766 is ready T128090
15:59 hashar: Nodepool: refreshing snapshot image to ship browsers+Xvfb for T128090
14:27 hashar: Mass refreshed CI slave-scripts 1d2c60d..e27c292
13:38 hashar: Rebased integration puppet master. Dropped a make-wmf-branch patch and the one for raita role
11:26 hashar: Nodepool: created new snapshot to set puppet $::labsproject : ci-jessie-wikimedia-1457436175 hoping to fix hiera lookup T129092
02:51 ori: deployment-prep Updating HHVM on deployment-mediawiki01
02:27 ori: deployment-prep Updating HHVM on deployment-mediawiki02
01:50 Krinkle: integration-saltmater: salt -v '*slave-trusty*' cmd.run 'rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm/src/skins/BlueSky' (T117710)
01:50 Krinkle: integration-saltmater: salt -v '*slave-trusty*' cmd.run 'rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm-composer/src/skins/BlueSky'

2016-03-07

21:03 hashar: Nodepool upgraded to 0.1.1-wmf.4 , it no more waits 1 minute before deleted a used node | T118573
20:05 hashar: Upgrading Nodepool from 0.1.1-wmf3 to 0.1.1-wmf.4 with andrewbogott | T118573

2016-03-06

10:20 legoktm: deploying https://gerrit.wikimedia.org/r/274911

2016-03-04

19:31 hashar: Nodepool Image ci-jessie-wikimedia-1457119603 in wmflabs-eqiad is ready - T128846
13:29 hashar: Nodepool Image ci-jessie-wikimedia-1457097785 in wmflabs-eqiad is ready
08:42 hashar: CI deleting integration-slave-precise-1001 (2 executors). It is not in labs DNS which causes bunch of issues, no need for the capacity anymore. T128802
02:49 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/274889
00:11 Krinkle: salt -v --show-timeout '*slave*' cmd.run "bash -c 'cd /srv/deployment/integration/slave-scripts; git pull'"

2016-03-03

23:37 legoktm: salt -v --show-timeout '*slave*' cmd.run "bash -c 'cd /srv/deployment/integration/slave-scripts; git pull'"
22:34 legoktm: mysql not running on integration-slave-precise-1002, manually starting (T109704)
22:30 legoktm: mysql not running on integration-slave-precise-1011, manually starting (T109704)
22:19 legoktm: mysql not running on integration-slave-precise-1012, manually starting (T109704)
22:07 legoktm: deploying https://gerrit.wikimedia.org/r/274821
21:58 Krinkle: Reloading Zuul to deploy (EventLogging and AdminLinks) https://gerrit.wikimedia.org/r/274821 /
18:49 thcipriani: killing deployment-bastion since it is no longer used
14:23 hashar: https://integration.wikimedia.org/ci/computer/integration-slave-trusty-1011/ is out of disk space

2016-03-02

16:22 jzerebecki: reloading zuul for 9398fa1..943f17b
10:38 hashar: Zuul should no more be caught in death loop due to Depends-On on an event-schemas change. Hole filled with https://gerrit.wikimedia.org/r/#/c/274356/ T128569
08:53 hashar: gerrit set-account Jsahleen --inactive T108854
01:19 thcipriani: force restarting zuul because the queue is very stuck https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Restart
01:13 thcipriani: following steps for gearman deadlock: https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Known_issues

2016-03-01

23:10 Krinkle: Updated Jenkins configuration to also support php5 and hhvm for Console Sections detection of "PHPUnit"
17:05 hashar: gerrit: set accounts inactive for Eloquence and Mgrover. Former employees of wmf and mail bounceback
16:41 hashar: Restarted Jenkins
16:32 hashar: Bunch of Jenkins job got stall because I have killed threads in Jenkins to unblock integration-slave-trusty-1003 :-(
12:14 hashar: integration-slave-trusty-1003 is back online
12:13 hashar: Might have killed the proper Jenkins thread to unlock integration-slave-trusty-1003
12:03 hashar: Jenkins can not pool back integration-slave-trusty-1003 Jenkins master has a bunch of blocking threads pilling up with hudson.plugins.sshslaves.SSHLauncher.afterDisconnect() locked somehow
11:41 hashar: Rebooting integration-slave-trusty-1003 (does not reply to salt / ssh)
10:34 hashar: Image ci-jessie-wikimedia-1456827861 in wmflabs-eqiad is ready
10:24 hashar: Refreshing Nodepool snapshot instances
10:22 hashar: Refreshing Nodepool base image to speed instances boot time (dropping open-iscsi package https://gerrit.wikimedia.org/r/#/c/273973/ )

2016-02-29

16:23 hashar: salt -v '*slave*' cmd.run 'rm -fR /mnt/jenkins-workspace/workspace/mwext*jslint' T127362
16:17 hashar: Deleting all mwext-.*-jslint jobs from Jenkins. Paladox has migrated all of them to jshint/jsonlint generic jobs T127362
16:16 hashar: Deleting all mwext-.*-jslint jobs from Jenkins. Paladox has migrated all of them to jshint/jsonlint generic jobs
09:46 hashar: Jenkins installing Yaml Axis Plugin 0.2.0

2016-02-28

01:30 Krinkle: Rebooting integration-slave-precise-1012 – Might help T109704 (MySQL not running)

2016-02-26

15:14 jzerebecki: salt -v --show-timeout '*slave*' cmd.run "bash -c 'cd /srv/deployment/integration/slave-scripts; git pull'" T128191
15:14 jzerebecki: salt -v --show-timeout '*slave*' cmd.run "bash -c 'cd /srv/deployment/integration/slave-scripts; git pull'"
14:44 hashar: (since it started, dont be that scared!)
14:44 hashar: Nodepool has triggered 40 000 instances
11:53 hashar: Restarted memcached on deployment-memc02 T128177
11:53 hashar: memcached process on deployment-memc02 seems to have a nice leak of socket usages (from lost) and plainly refuse connections (bunch of CLOSE_WAIT) T128177
11:53 hashar: memcached process on deployment-memc02 seems to have a nice leak of socket usages (from lost) and plainly refuse connections (bunch of CLOSE_WAIT)
11:40 hashar: deployment-memc04 find /etc/apt -name '*proxy' -delete (prevented apt-get update)
11:26 hashar: beta: salt -v '*' cmd.run 'apt-get -y install ruby-msgpack' . I am tired of seeing puppet debug messages: "Debug: Failed to load library 'msgpack' for feature 'msgpack'"
11:24 hashar: puppet keep restarting nutcracker apparently T128177
11:20 hashar: Memcached error for key "enwiki:flow_workflow%3Av2%3Apk:63dc3cf6a7184c32477496d63c173f9c:4.8" on server "127.0.0.1:11212": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY

2016-02-25

22:38 hashar: beta: maybe deployment-jobunner01 is processing jobs a bit faster now. Seems like hhvm went wild
22:23 hashar: beta: jobrunner01 had apache/hhvm killed somehow .... Blame me
21:56 hashar: beta: stopped jobchron / jobrunner on deployment-jobrunner01 and restarting them by running puppet
21:49 hashar: beta did a git-deploy of jobrunner/jobrunner hoping to fix puppet run on deployment-jobrunner01 and apparently it did! T126846
11:21 hashar: deleting workspace /mnt/jenkins-workspace/workspace/browsertests-Wikidata-WikidataTests-linux-firefox-sauce on slave-trusty-1015
10:08 hashar: Jenkins upgraded T128006
01:44 legoktm: deploying https://gerrit.wikimedia.org/r/273170
01:39 legoktm: deploying https://gerrit.wikimedia.org/r/272955 (undeployed) and https://gerrit.wikimedia.org/r/273136
01:37 legoktm: deploying https://gerrit.wikimedia.org/r/273136
00:31 thcipriani: running puppet on beta to update scap to latest packaged version: sudo salt -b '10%' -G 'deployment_target:scap/scap' cmd.run 'puppet agent -t'
00:20 thcipriani: deployment-tin not accepting jobs for some time, ran through https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Hung_beta_code.2Fdb_update, is back now

2016-02-24

19:55 legoktm: legoktm@deployment-tin:~$ mwscript extensions/ORES/maintenance/PopulateDatabase.php --wiki=enwiki
18:30 bd808: "configuration file '/etc/nutcracker/nutcracker.yml' syntax is invalid"
18:27 bd808: nutcracker dead on mediawiki01; investigating
17:20 hashar: Deleted Nodepool instances so new ones get to use the new snapshot ci-jessie-wikimedia-1456333979
17:12 hashar: Refreshing nodepool snapshot. Been stall since Feb 15th T127755
17:01 bd808: https://wmflabs.org/sal/releng missing SAL data since 2016-02-20T20:19 due to bot crash; needs to be backfilled from wikitech data (T127981)
16:43 hashar: sal on elastic search is stall https://phabricator.wikimedia.org/T127981
15:07 hasharAW: beta app servers have lost access to memcached due to bad nutcracker conf | T127966
14:41 hashar: beta: we have a lost a memcached server 11:51am UTC

2016-02-23

22:45 thcipriani: deployment-puppetmaster is in a weird rebase state
22:25 legoktm: running sync-common manually on deployment-mediawiki02
09:59 hashar: Deleted a bunch of mwext-.*-jslint jobs that are no more in used (migrated to either 'npm' or 'jshint' / 'jsonlint' )

2016-02-22

22:06 bd808: Restarted puppetmaster service on deployment-puppetmaster to "fix" error "invalid byte sequence in US-ASCII"
17:46 jzerebecki: ssh integration-slave-trusty-1017.eqiad.wmflabs 'sudo -u jenkins-deploy rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm/src/.git/config.lock
16:47 gehel: deployment-prep upgrading deployment-logstash2 to elasticsearch 1.7.5
10:26 gehel: deployment-prep upgrading elastic-search to 1.7.5 on deployment-elastic0[5-8]

2016-02-20

20:19 Krinkle: beta-code-update-eqiad job repeatedly stuck at "IRC notifier plugin"
19:29 Krinkle: beta-code-update-eqiad broken because deployment-tin:/srv/mediawiki-staging/php-master/extensions/MobileFrontend/includes/MobileFrontend.hooks.php was modified on the server without commit
19:22 Krinkle: Various beta-mediawiki-config-update-eqiad jobs have been stuck 'queued' for > 24 hours

2016-02-19

12:09 hashar: killed https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ been running for 13 hours. Blocked because slave went offline due to labs reboots yesterday
10:15 hashar: Creating a bunch of repository in GitHub to fix Gerrit replication errors

2016-02-18

19:20 legoktm: deploying https://gerrit.wikimedia.org/r/271583 and https://gerrit.wikimedia.org/r/271581, both no-ops
18:14 legoktm: deploying https://gerrit.wikimedia.org/r/271012
17:36 legoktm: deploying https://gerrit.wikimedia.org/r/271555
16:01 hashar: deleting instance integration-slave-precise-1003 think we have enough precise slaves
10:44 hashar: Nodepool: JenkinsException: Could not parse JSON info for server[1]

2016-02-17

07:36 legoktm: deploying https://gerrit.wikimedia.org/r/271201
01:01 yuvipanda: attempting to turn off NFS on 52 instances on deployment-prep project

2016-02-16

23:22 yuvipanda: new instances on deployment-prep no longer get NFS because of https://wikitech.wikimedia.org/w/index.php?title=Hiera%3ADeployment-prep&type=revision&diff=311783&oldid=311781
23:18 hashar: jenkins@gallium find /var/lib/jenkins/config-history/nodes -maxdepth 1 -type d -name 'ci-jessie*' -exec rm -vfR {} \;
23:17 hashar: Jenkins accepting slave creations again. Root cause is /var/lib/jenkins/config-history/nodes/ has reached the 32k inode limit.
23:14 hashar: Jenkins: Could not create rootDir /var/lib/jenkins/config-history/nodes/ci-jessie-wikimedia-34969/2016-02-16_22-40-23
23:02 hashar: Nodepool can not authenticate with Jenkins anymore. Thus it can not add slaves it spawned.
22:56 hashar: contint: Nodepool instances pool exhausted
21:14 andrewbogott: deployment-logstash2 migration finished
20:49 jzerebecki: reloading zuul for 3bf7584..67fec7b
19:58 andrewbogott: migrating deployment-logstash2 to labvirt1010
19:00 hashar: tin: checking out mw 1.27.0-wmf.14
15:23 hashar: integration-make-wmfbranch : /mnt/make-wmf-branch mount now has gid=wikidev and group setuid (i.e. mode 2775)
15:20 hashar: integration-make-wmfbranch : change tmpfs to /mnt/make-wmf-branch (from /var/make-wmf-branch )
11:30 jzerebecki: T117710 integration-saltmaster:~# salt -v '*slave-trusty*' cmd.run 'rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm-composer/src/skins/BlueSky'
09:52 hashar: will cut the wmf branches this afternoon starting around 14:00 CET

2016-02-15

16:28 jzerebecki: reloading zuul for 2d16ad3..3bb0afa
16:10 hashar: Image ci-jessie-wikimedia-1455552377 in wmflabs-eqiad is ready
15:25 jzerebecki: reloading zuul for e174335..2d16ad3
15:23 hashar: Image ci-jessie-wikimedia-1455549539 in wmflabs-eqiad is ready
15:19 hashar: Regenerating Nodepool snapshot. Slave scripts have 0 bytes...
15:04 hashar: Slave scripts added to Nodepool instances! Image ci-jessie-wikimedia-1455548346 in wmflabs-eqiad is ready
11:05 hashar: Image ci-jessie-wikimedia-1455534001 in wmflabs-eqiad is ready
07:52 legoktm: deploying https://gerrit.wikimedia.org/r/270686
06:52 legoktm: legoktm@gallium:/srv/org/wikimedia/doc$ sudo -u jenkins-slave rm -rf EventLogging/ GuidedTour/ MultimediaViewer/ TemplateData/
06:22 legoktm: deploying https://gerrit.wikimedia.org/r/270677
06:12 legoktm: deploying https://gerrit.wikimedia.org/r/270675
06:02 legoktm: deploying https://gerrit.wikimedia.org/r/270674
05:56 legoktm: deploying https://gerrit.wikimedia.org/r/270673
05:32 legoktm: deploying https://gerrit.wikimedia.org/r/270670
04:05 legoktm: deploying https://gerrit.wikimedia.org/r/270667
03:26 legoktm: deploying https://gerrit.wikimedia.org/r/270665
02:56 legoktm: deploying https://gerrit.wikimedia.org/r/270657

2016-02-14

23:54 legoktm: deploying https://gerrit.wikimedia.org/r/270656
23:25 legoktm: deploying https://gerrit.wikimedia.org/r/270654
23:13 legoktm: also deploying https://gerrit.wikimedia.org/r/#/c/265098/
23:11 legoktm: deploying https://gerrit.wikimedia.org/r/270651
05:18 bd808: tools.stashbot Testing after restart (T126419)

2016-02-13

06:42 bd808: restarted nutcracker on deployment-mediawiki01
06:32 bd808: jobrunner on deployment-jobrunner01 enabled after reverting changes from T87928 that caused T126830
05:51 bd808: disabled jobrunner process on jobrunner01; queue full of jobs broken by T126830
05:31 bd808: trebuchet clone of /srv/jobrunner/jobrunner broken on jobrunner01; failing puppet runs
05:25 bd808: jobrunner process on deployment-jobrunner01 badly broken; investigating
05:20 bd808: Ran https://phabricator.wikimedia.org/P2273 on deployment-jobrunner01.deployment-prep.eqiad.wmflabs; freed ~500M; disk utilization still at 94%

2016-02-12

23:54 hashar: beta cluster broken since 20:30 UTC https://logstash-beta.wmflabs.org/#/dashboard/elasticsearch/fatalmonitor havent looked
17:36 hashar: salt -v '*slave-trusty*' cmd.run 'apt-get -y install texlive-generic-extra' # T126422
17:32 hashar: adding texlive-generic-extra on CI slaves by cherry picking https://gerrit.wikimedia.org/r/#/c/270322/ - T126422
17:19 hashar: get rid of integration-dev it is broken somehow
17:10 hashar: Nodepool back at spawning instances. contintcloud has been migrated in wmflabs
16:51 thcipriani: running sudo salt '*' -b '10%' deploy.fixurl to fix deployment-prep trebuchet urls
16:31 hashar: bd808 added support for saltbot to update tasks automagically!!!! T108720
03:10 yurik: attempted to sync graphoid from gerrit 270166 from deployment-tin, but it wouldn't sync. Tried to git pull sca02, submodules wouldn't pull

2016-02-11

22:53 thcipriani: shutting down deployment-bastion
21:28 hashar: pooling back slaves 1001 to 1006
21:18 hashar: re enabling hhvm service on slaves ( https://phabricator.wikimedia.org/T126594 ) Some symlink is missing and only provided by the upstart script grrrrrrr https://phabricator.wikimedia.org/T126658
20:52 legoktm: deploying https://gerrit.wikimedia.org/r/270098
20:35 hashar: depooling the six recent slaves: /usr/lib/x86_64-linux-gnu/hhvm/extensions/current/luasandbox.so cannot open shared object file
20:29 hashar: pooling integration-slave-trusty-1004 integration-slave-trusty-1005 integration-slave-trusty-1006
20:14 hashar: pooling integration-slave-trusty-1001 integration-slave-trusty-1002 integration-slave-trusty-1003
19:35 marxarelli: modifying deployment server node in jenkins to point to deployment-tin
19:27 thcipriani: running sudo salt -b '10%' '*' cmd.run 'puppet agent -t' from deployment-salt
19:27 twentyafterfour: Keeping notes on the ticket: https://phabricator.wikimedia.org/T126537
19:24 thcipriani: moving deployment-bastion to deployment-tin
17:59 hashar: recreated instances with proper names: integration-slave-trusty-{1001-1006}
17:52 hashar: Created integration-slave-trusty-{1019-1026} as m1.large (note 1023 is an exception it is for Android). Applied role::ci::slave , lets wait for puppet to finish
17:42 Krinkle: Currently testing https://gerrit.wikimedia.org/r/#/c/268802/ in Beta Labs
17:27 hashar: Depooling all the ci.medium slaves and deleting them.
17:27 hashar: I tried. The ci.medium instances are too small and MediaWiki tests really need 1.5GBytes of memory :-(
16:00 hashar: rebuilding integration-dev https://phabricator.wikimedia.org/T126613
15:27 Krinkle: Deploy Zuul config change https://gerrit.wikimedia.org/r/269976
11:46 hashar: salt -v '*' cmd.run '/etc/init.d/apache2 restart' might help for Wikidata browser tests failling
11:32 hashar: disabling hhvm service on CI slaves ( https://phabricator.wikimedia.org/T126594 , cherry picked both patches )
10:50 hashar: reenabled puppet on CI. All transitioned to a 128MB tmpfs (was 512MB)
10:16 hashar: pooling back integration-slave-trusty-1009 and integration-slave-trusty-1010 (tmpfs shrunken)
10:06 hashar: disabling puppet on all CI slaves. Trying to lower tmpfs 512MB to 128MB ( https://gerrit.wikimedia.org/r/#/c/269880/ )
02:45 legoktm: deploying https://gerrit.wikimedia.org/r/269853 https://gerrit.wikimedia.org/r/269893

2016-02-10

23:54 hashar_: depooling Trusty slaves that only have 2GB of ram that is not enough. https://phabricator.wikimedia.org/T126545
22:55 hashar_: gallium: find /var/lib/jenkins/config-history/config -type f -wholename '*/2015*' -delete ( https://phabricator.wikimedia.org/T126552 )
22:34 Krinkle: Zuul is back up and procesing Gerrit events, but jobs are still queued indefinitely. Jenkins is not accepting new jobs
22:31 Krinkle: Full restart of Zuul. Seems Gearman/Zuul got stuck. All executors were idling. No new Gerrit events processed either.
21:22 legoktm: cherry-picking https://gerrit.wikimedia.org/r/#/c/269370/ on integration-puppetmaster again
21:17 hashar: CI dust have settled. Krinkle and I have pooled a lot more Trusty slaves to accommodate for the overload caused by switching to php55 (jobs run on Trusty)
21:08 hashar: pooling trusty slaves 1009, 1010, 1021, 1022 with 2 executors (they are ci.medium)
20:38 hashar: cancelling mediawiki-core-jsduck-publish and mediawiki-core-doxygen-publish jobs manually. They will catch up on next merge
20:34 Krinkle: Pooled integration-slave-trusty-1019 (new)
20:28 Krinkle: Pooled integration-slave-trusty-1020 (new)
20:24 Krinkle: created integration-slave-trusty-1019 and integration-slave-trusty-1020 (ci1.medium)
20:18 hashar: created integration-slave-trusty-1009 and 1010 (trusty ci.medium)
20:06 hashar: creating integration-slave-trusty-1021 and integration-slave-trusty-1022 (ci.medium)
19:48 greg-g: that cleanup was done by apergos
19:48 greg-g: did cleanup across all integration slaves, some were very close to out of room. results: https://phabricator.wikimedia.org/P2587
19:43 hashar: Dropping slaves Precise m1.large integration-slave-precise-1014 and integration-slave-precise-1013 , most load shifted to Trusty (php53 -> php55 transition)
18:20 Krinkle: Creating a Trusty slave to support increased demand following MediaWIki php53(precise)>php55(trusty) bump
16:06 jzerebecki: reloading zuul for 41a92d5..5b971d1
15:42 jzerebecki: reloading zuul for 639dd40..41a92d5
14:12 jzerebecki: recover a bit of disk space: integration-saltmaster:~# salt --show-timeout '*slave*' cmd.run 'rm -rf /mnt/jenkins-workspace/workspace/*WikibaseQuality*'
13:46 jzerebecki: reloading zuul for 639dd40
13:15 jzerebecki: reloading zuul for 3be81c1..e8e0615
08:07 legoktm: deploying https://gerrit.wikimedia.org/r/269619
08:03 legoktm: deploying https://gerrit.wikimedia.org/r/269613 and https://gerrit.wikimedia.org/r/269618
06:41 legoktm: deploying https://gerrit.wikimedia.org/r/269607
06:34 legoktm: deploying https://gerrit.wikimedia.org/r/269605
02:59 legoktm: deleting 14GB broken workspace of mediawiki-core-php53lint from integration-slave-precise-1004
02:37 legoktm: deleting /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm-composer on trusty-1017, it had a skin cloned into it
02:26 legoktm: queuing mwext jobs server-side to identify failing ones
02:21 legoktm: deploying https://gerrit.wikimedia.org/r/269582
01:03 legoktm: deploying https://gerrit.wikimedia.org/r/269576

2016-02-09

23:17 legoktm: deploying https://gerrit.wikimedia.org/r/269551
23:02 legoktm: gracefully restarting zuul
22:57 legoktm: deploying https://gerrit.wikimedia.org/r/269547
22:29 legoktm: deploying https://gerrit.wikimedia.org/r/269540
22:18 legoktm: re-enabling puppet on all CI slaves
22:02 legoktm: reloading zuul to see if it'll pickup the new composer-php53 job
21:53 legoktm: enabling puppet on just integration-slave-trusty-1012
21:52 legoktm: cherry-picked https://gerrit.wikimedia.org/r/#/c/269370/ onto integration-puppetmaster
21:50 legoktm: disabling puppet on all trusty/precise CI slaves
21:40 legoktm: deploying https://gerrit.wikimedia.org/r/269533
17:49 marxarelli: disabled/enabled gearman in jenkins, connection works this time
17:49 marxarelli: performed stop/start of zuul on gallium to restore zuul and gearman
17:45 marxarelli: "Failed: Unable to Connect" in jenkins when testing gearman connection
17:40 marxarelli: killed old zull process manually and restarted service
17:39 marxarelli: restart of zuul fails as well. old process cannot be killed
17:38 marxarelli: reloading zuul fails with "failed to kill 13660: Operation not permitted"
16:06 bd808: Deleted corrupt integration-slave-precise-1003:/mnt/jenkins-workspace/workspace/mediawiki-core-php53lint/.git
15:11 hashar: mira: /srv/mediawiki-staging/multiversion/checkoutMediaWiki 1.27.0-wmf.13 php-1.27.0-wmf.13
14:51 hashar: ./make-wmf-branch -n 1.27.0-wmf.13 -o master
14:50 hashar: pooling back integration-slave-precise1001 - 1004. Manually fetched git repos in workspace for mediawiki core php53
14:49 hashar: make-wmf-branch instance: created a local ssh key pair and set the config to use User: hashar
14:13 hashar: pooling https://integration.wikimedia.org/ci/computer/integration-slave-precise-1012/ Mysql is back .. Blame puppet
14:12 hashar: de pooling https://integration.wikimedia.org/ci/computer/integration-slave-precise-1012/ Mysql is gone somehow
14:04 hashar: Manually git fetching mediawiki-core in /mnt/jenkins-workspace/workspace/mediawiki-core-php53lint of slaves precise 1001 to 1004 (git on Precise is remarkably too slow)
13:28 hashar: salt '*trusty*' cmd.run 'update-alternatives --set php /usr/bin/hhvm'
13:28 hashar: salt '*precise*' cmd.run 'update-alternatives --set php /usr/bin/php5'
13:18 hashar: salt -v --batch=3 '*slave*' cmd.run 'puppet agent -tv'
13:15 hashar: removing https://gerrit.wikimedia.org/r/#/c/269370/ from CI puppet master
13:14 hashar: slave recurse infinitely doing /bin/bash -eu /srv/deployment/integration/slave-scripts/bin/mw-install-mysql.sh then loop over /bin/bash /usr/bin/php maintenance/install.php --confpath /mnt/jenkins-workspace/workspace/mediawiki-core-qunit/src --dbtype=mysql --dbserver=127.0.0.1:3306 --dbuser=jenkins_u2 --dbpass=pw_jenkins_u2 --dbname=jenkins_u2_mw --pass testpass TestWiki WikiAdmin https://phabricator.wikimedia.org/T126327
12:46 hashar: Mass testing php loop of death: salt -v '*slave*' cmd.run 'timeout 2s /srv/deployment/integration/slave-scripts/bin/php --version'
12:40 hashar: mass rebooting CI slaves from wikitech
12:39 hashar: salt -v '*' cmd.run "bash -c 'cd /srv/deployment/integration/slave-scripts; git pull'"
12:33 hashar: all slaves dieing due to PHP looping
12:02 legoktm: re-enabling puppet on all trusty/precise slaves
11:20 legoktm: cherry-picked https://gerrit.wikimedia.org/r/#/c/269370/ on integration-puppetmaster
11:20 legoktm: enabling puppet just on integration-slave-trusty-1012
11:13 legoktm: disabling puppet on all *(trusty|precise)* slaves
10:26 hashar: pooling in integration-slave-trusty-1018
03:19 legoktm: deploying https://gerrit.wikimedia.org/r/269359
02:53 legoktm: deploying https://gerrit.wikimedia.org/r/238988
00:39 hashar: gallium edited /usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/trigger/gerrit.py and modified: replication_timeout = 300 -> replication_timeout = 10
00:37 hashar: live hacking Zuul code to have it stop sleeping() on force merge
00:36 hashar: killing zuul

2016-02-08

23:48 legoktm: finally deploying https://gerrit.wikimedia.org/r/269327
23:14 hashar: zuul promote --pipeline gate-and-submit --changes 269065,2 https://gerrit.wikimedia.org/r/#/c/269065/
23:10 hashar: pooling integration-slave-precise-1001 1002 1004
22:47 hashar: Err need to reboot newly provisioned instances before adding them to Jenkins (kernel upgrade,apache restart etc)
22:45 hashar: Pooled https://integration.wikimedia.org/ci/computer/integration-slave-precise-1003/
22:25 hashar: integration-slave-precise-{1001-1004} applied role::ci::slave::labs, running puppet in slaves. I have added the instances as Jenkins slaves and put them offline. Whenever puppet is done, we can mark them online in Jenkins then monitor the jobs running on them are working properly
22:15 hashar: Provisioning integration-slave-precise-{1001-1004} https://phabricator.wikimedia.org/T126274 (need more php53 slots)
22:13 hashar: Deleted cache-rsync instance superseded by castor instance
22:10 hashar: Deleting pmcache.integration.eqiad.wmflabs (was to investigate various kind of central caches).
20:14 marxarelli: aborting pending mediawiki-extensions-php53 job for CheckUser
20:08 bd808: toggled "Enable Gearman" off and on in Jenkins to wake up deployment-bastion workers
14:54 hashar: nodepool: refreshed snapshot image , Image ci-jessie-wikimedia-1454942958 in wmflabs-eqiad is ready
14:47 hashar: regenerated nodepool reference image (got rid of grunt-cli https://gerrit.wikimedia.org/r/269126 )
09:41 legoktm: deploying https://gerrit.wikimedia.org/r/269093 https://gerrit.wikimedia.org/r/269094
09:36 hashar: restarting integration puppetmaster (out of memory / cannot fork)
06:11 bd808: tgr set $wgAuthenticationTokenVersion on beta cluster (test run for T124440)
02:09 legoktm[NE]: deploying https://gerrit.wikimedia.org/r/268047
00:57 legoktm[NE]: deploying https://gerrit.wikimedia.org/r/268031

2016-02-06

18:34 jzerebecki: reloading zuul for bdb2ed4..46ccca9

2016-02-05

13:30 hashar: beta cleaning out /data/project/logs/archive was from pre logstash area. We no more log this way since May 2015 apparently
13:29 hashar: beta deleting /data/project/swift-disk created in august 2014 , unused since june 2015. Was a fail attempt at bringing swift to beta
13:27 hashar: beta: reclaiming disk space from extensions.git. On bastion: find /srv/mediawiki-staging/php-master/extensions/.git/modules -maxdepth 1 -type d -print -execdir git gc \;
13:03 hashar: integration-slave-trusty-1011 went out of disk space. Did some brute clean up and git gc.
05:21 Tim: configured mediawiki-extensions-qunit to only run on integration-slave-trusty-1017, did a rebuild and then switched it back

2016-02-04

22:08 jzerebecki: reloading zuul for bed7be1..f57b7e2
21:51 hashar: salt-key -d integration-slave-jessie-1001.eqiad.wmflabs
21:50 hashar: salt-key -d integration-slave-precise-1011.eqiad.wmflabs
00:57 bd808: Got deployment-bastion processing Jenkins jobs again via instructions left by my past self at https://phabricator.wikimedia.org/T72597#747925
00:43 bd808: Jenkins agent on deployment-bastion.eqiad doing the trick where it doesn't pick up jobs again

2016-02-03

22:24 bd808: Manually ran sync-common on deployment-jobrunner01.eqiad.wmflabs to pickup wmf-config changes that were missing (InitializeSettings, Wikibase, mobile)
17:43 marxarelli: Reloading Zuul to deploy previously undeployed Icd349069ec53980ece2ce2d8df5ee481ff44d5d0 and Ib18fe48fe771a3fe381ff4b8c7ee2afb9ebb59e4
15:12 hashar: apt-get upgrade deployment-sentry2
15:03 hashar: redeployed rcstream/rcstream on deployment-stream by using git-deploy on deployment-bastion
14:55 hashar: upgrading deployment-stream
14:42 hashar: pooled back integration-slave-trusty-1015 Seems ok
14:35 hashar: manually triggered a bunch of browser tests jobs
11:40 hashar: apt-get upgrade deployment-ms-be01 and deployment-ms-be02
11:32 hashar: fixing puppet.conf on deployment-memc04
11:09 hashar: restarting beta cluster puppetmaster just in case
11:07 hashar: beta: apt-get upgrade on delpoyment-cache* hosts and checking puppet
10:59 hashar: integration/beta: deleting /etc/apt/apt.conf.d/*proxy files. There is no need for them, in fact web proxy is not reachable from labs
10:53 hashar: integration: switched puppet repo back to 'production' branch, rebased.
10:49 hashar: various beta cluster have puppet errors ..
10:46 hashar: integration-slave-trusty-1013 heading to out of disk space on /mnt ...
10:42 hashar: integration-slave-trusty-1016 out of disk space on /mnt ...
03:45 bd808: Puppet failing on deployment-fluorine with "Error: Could not set uid on user[datasets]: Execution of '/usr/sbin/usermod -u 10003 datasets' returned 4: usermod: UID '10003' already exists"
03:44 bd808: Freed 28G by deleting deployment-fluorine:/srv/mw-log/archive/*2015*
03:42 bd808: Ran deployment-bastion.deployment-prep:/home/bd808/cleanup-var-crap.sh and freed 565M

2016-02-02

18:32 marxarelli: Reloading Zuul to deploy If1f3cb60f4ccb2c1bca112900dbada03a8588370
17:42 marxarelli: cleaning mwext-donationinterfacecore125-testextension-php53 workspace on integration-slave-precise-1013
17:06 ostriches: running sync-common on mw2051 and mw1119
09:38 hashar: Jenkins is fully up and operational
09:33 hashar: restarting Jenkins
08:47 hashar: pooling back integration-slave-precise1011 , puppet run got fixed ( https://phabricator.wikimedia.org/T125474 )
03:48 legoktm: deploying https://gerrit.wikimedia.org/r/267828
03:29 legoktm: deploying https://gerrit.wikimedia.org/r/266941
00:42 legoktm: due to T125474
00:42 legoktm: marked integration-slave-precise-1011 as offline
00:39 legoktm: precise-1011 slave hasn't had a puppet run in 6 days

2016-02-01

23:53 bd808: Logstash working again; I applied a change to the default mapping template for Elasticsearch that ensures that fields named "timestamp" are indexed as plain strings
23:46 bd808: Elasticsearch index template for beta logstash cluster making crappy guesses about syslog events; dropped 2016-02-01 index; trying to fix default mappings
23:09 bd808: HHVM logs causing rejections during document parse when inserting in Elasticsearch from logstash. They contain a "timestamp" field that looks like "Feb 1 22:56:39" which is making the mapper in Elasticsearch sad.
23:04 bd808: Elasticsearch on deployment-logstash2 rejecting all documents with 400 status. Investigating
22:50 bd808: Copying deployment-logstash2.deployment-prep:/var/log/logstash/logstash.log to /srv for debugging later
22:48 bd808: deployment-logstash2.deployment-prep:/var/log/logstash/logstash.log is 11G of fail!
22:46 bd808: root partition on deployment-logstash2 full
22:43 bd808: No data in logstash since 2016-01-30T06:55:37.838Z; investigating
15:33 hashar: Image ci-jessie-wikimedia-1454339883 in wmflabs-eqiad is ready
15:01 hashar: Refreshing Nodepool image. Might have npm/grunt properly set up
03:15 legoktm: deploying https://gerrit.wikimedia.org/r/267630

2016-01-31

13:35 hashar: Jenkins IRC bot started falling at Jan 30 01:04:00 2016 for whatever reason.... Should be fine now
13:33 hashar: cancelling/aborting jobs that are stuck while reporting to IRC (mostly browser tests and beta cluster jobs)
13:32 hashar: Jenkins jobs are being blocked because they can no more report back to IRC :-(((
13:28 hashar: Jenkins jobs are being blocked because they can no more report back to IRC :-(((

2016-01-30

12:46 hashar: integration-slave-jessie-1001 : fixed puppet.con server name and ran puppet

2016-01-29

18:43 thcipriani: updated scap on beta
16:44 thcipriani: deployed scap updates on beta
11:58 _joe_: upgraded hhvm to 3.6 wm8 in deployment-prep

2016-01-28

23:22 MaxSem: Updated portals on betalabs to master
22:23 hashar: salt '*slave-precise*' cmd.run 'apt-get install php5-ldap' ( https://phabricator.wikimedia.org/T124613 ) will need to be puppetized
18:17 thcipriani: cleaning npm cache on slave machines: salt -v '*slave*' cmd.run 'sudo -i -u jenkins-deploy -- npm cache clean'
18:12 thcipriani: running npm cache clean on integration-slave-precise-1011 sudo -i -u jenkins-deploy -- npm cache clean
15:25 hashar: apt-get upgrade deployment-sca01 and deployment-sca02
15:09 hashar: fixing puppet.conf hostname on deployment-upload deployment-conftool deployment-tmh01 deployment-zookeeper01 and deployment-urldownloader
15:06 hashar: fixing puppet.con hostname on deployment-upload.deployment-prep.eqiad.wmflabs and running puppet
15:00 hashar: Running puppet on deployment-memc02 and deployment-elastic07 . It is catching up with lot of changes
14:59 hashar: fixing puppet hostnames on deployment-elastic07
14:59 hashar: fixing puppet hostnames on deployment-memc02
14:55 hashar: Deleted salt keys deployment-pdf01.eqiad.wmflabs and deployment-memc04.eqiad.wmflabs (obsolete, entries with '.deployment-prep.' are already there)
07:38 jzerebecki: reload zuul for 4951444..43a030b
05:55 jzerebecki: doing https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Hung_beta_code.2Fdb_update
03:49 mobrovac: deployment-prep re-enabled puppet on deployment-restbase0x
02:49 mobrovac: deployment-prep deployment-restbase01 disabled puppet to set up cassandra for
02:27 mobrovac: deployment-prep recreating deployment-restbase01 for T125003
02:23 mobrovac: deployment-prep deployment-restbase02 disabled puppet to recreate deployment-restbase01 for T125003
01:42 mobrovac: deployment-prep recreating deployment-sca02 for T125003
01:28 mobrovac: deployment-prep recreating deployment-sca01 for T125003
00:36 mobrovac: deployment-prep re-imaging deployment-mathoid for T125003
00:02 jzerebecki: integration-slave-trusty-1016:~$ sudo -i rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm/src/skins/Donate

2016-01-27

23:49 jzerebecki: integration-slave-precise-1011:~$ sudo -i /etc/init.d/salt-minion restart
23:46 jzerebecki: work around https://phabricator.wikimedia.org/T117710 : salt --show-timeout '*slave*' cmd.run 'rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm/src/skins/BlueSky'
21:19 cscott: updated OCG to version 64050af0456a43344b32e3e93561a79207565eaf (should be no-op after yesterday's deploy)
10:29 hashar: triggered bunch of browser tests, deployment-redis01 was dead/faulty
10:08 hashar: mass restarting redis-server process on deployment-redis01 (for https://phabricator.wikimedia.org/T124677 )
10:07 hashar: mass restarting redis-server process on deployment-redis01
09:00 hashar: beta: commenting out "latency-monitor-threshold 100" parameter from any /etc/redis/redis.conf we have ( https://phabricator.wikimedia.org/T124677 ). Puppet will not reapply it unless distribution is Jessie

2016-01-26

16:51 cscott: updated OCG to version 64050af0456a43344b32e3e93561a79207565eaf
12:14 hashar: Added Jenkins IRC bot (wmf-insecte) to #wikimedia-perf for https://gerrit.wikimedia.org/r/#/c/265631/
09:30 hashar: restarting Jenkins to upgrade the gearman plugin with https://review.openstack.org/#/c/271543/
04:18 bd808: integration-slave-jessie-1001:/mnt full; cleaned up 15G of files in /mnt/pbuilder/build (27 hours after the last time I did that)

2016-01-25

18:59 twentyafterfour: started redis-server on deployment-redis01 by commenting out latency-monitor-threshold from the redis.conf
15:22 hashar: CI: fixing kernels not upgrading via: rm /boot/grub/menu.lst ; update-grub -y (i.e.: regenerate the Grub menu from scratch)
14:21 hashar: integration-slave-trusty-1015.integration.eqiad.wmflabs is gone. I have failed the kernel upgrade / grub update
01:35 bd808: integration-slave-jessie-1001:/mnt full; cleaned up 15G of files in /mnt/pbuilder/build

2016-01-24

06:45 legoktm: deploying https://gerrit.wikimedia.org/r/266039
06:13 legoktm: deploying https://gerrit.wikimedia.org/r/266041

2016-01-22

23:58 legoktm: removed skins from mwext-qunit workspace on trusty-1013 slave
23:34 legoktm: rm -rf /mnt/jenkins-workspace/workspace/mediawiki-phpunit-php53 on slave precise 1012
22:45 legoktm: deploying https://gerrit.wikimedia.org/r/265864
22:27 hashar: rebooted all CI slaves using OpenStackManager
22:09 hashar: rebooting deployment-redis01 (kernel upgrade)
21:22 hashar: Image ci-jessie-wikimedia-1453497269 in wmflabs-eqiad is ready (with node 4.2 for https://phabricator.wikimedia.org/T119143 )
21:14 hashar: updating nodepool snapshot based on new image
21:12 hashar: rebuilding nodepool reference image
20:04 hashar: Image ci-jessie-wikimedia-1453492820 in wmflabs-eqiad is ready
20:00 hashar: Refreshing nodepool image to hopefully get Nodejs 4.2.4 https://phabricator.wikimedia.org/T124447 https://gerrit.wikimedia.org/r/#/c/265802/
16:32 hashar: Nuked corrupted git repo on integration-slave-precise-1012 /mnt/jenkins-workspace/workspace/mediawiki-extensions-php53
12:23 hashar: beta: reinitialized keyholder on deployment-bastion. The proxy apparently had no identity
09:32 hashar: beta cluster Jenkins job have been stalled for 9hours and 25 minutes. Disabling/reenabling the Gearman plugin to remove the deadlock

2016-01-21

21:41 hashar: restored role::mail::mx on deployment-mx
21:36 hashar: dropping role::mail::mx from deployment-mx to let puppet run
21:33 hashar: rebooting deployment-jobrunner01 / kernel upgrade / /tmp is only 1MBytes
21:19 hashar: fixing up deployment-jobrunner01 /tmp and / disks are full
19:57 thcipriani: ran REPAIR TABLE globalnames; on centralauth db
19:48 legoktm: deploying https://gerrit.wikimedia.org/r/265552
19:39 legoktm: deploying jjb changes for https://gerrit.wikimedia.org/r/264990
19:25 legoktm: deploying https://gerrit.wikimedia.org/r/265546
01:59 jzerebecki: jenkins-deploy@deployment-bastion:/srv/mediawiki-staging/php-master/extensions/SpellingDictionary$ rm -r modules/jquery.uls && git rm modules/jquery.uls
01:00 jzerebecki: jenkins-deploy@deployment-bastion:/srv/mediawiki-staging/php-master/extensions$ git pull && git submodule update --init --recursive
00:57 jzerebecki: jenkins-deploy@deployment-bastion:/srv/mediawiki-staging/php-master/extensions$ git reset HEAD SpellingDictionary

2016-01-20

20:05 hashar: beta sudo find /data/project/upload7/math -type f -delete (probably some old left over)
19:50 hashar: beta: on commons ran deleteArchivedFile.php : Nuked 7130 files
19:49 hashar: beta : foreachwiki deleteArchivedRevisions.php -delete
19:26 hasharAway: Nuked all files from http://commons.wikimedia.beta.wmflabs.org/wiki/Category:GWToolset_Batch_Upload
19:19 hasharAway: beta: sudo find /data/project/upload7/*/*/temp -type f -delete
19:14 hasharAway: beta: sudo rm /data/project/upload7/*/*/lockdir/*
18:57 hasharAway: beta cluster code has been stalled for roughly 2h30
18:55 hasharAway: disconnecting Gearman plugin to remove deadlock for beta cluster rjobs
17:06 hashar: clearing files from beta-cluster to prepare for Swift migration. python pwb.py delete.py -family:betacommons -lang:en -cat:'GWToolset Batch Upload' -verbose -putthrottle:0 -summary:'Clearing out old batched upload to save up disk space for Swift migration'

2016-01-19

22:25 legoktm: deleting *zend* workspaces on precise slaves
21:58 thcipriani: trying https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Hung_beta_code.2Fdb_update again
21:57 thcipriani: beta-scap-eqiad still can't find executor on deployment-bastion.eqiad
21:52 thcipriani: following steps at https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Hung_beta_code.2Fdb_update for deployment-bastion
19:34 legoktm: deleting all *zend* jobs from jenkins
09:40 hashar: Created github repo https://github.com/wikimedia/operations-debs-varnish4
03:59 legoktm: deploying https://gerrit.wikimedia.org/r/264912 and https://gerrit.wikimedia.org/r/264922

2016-01-17

18:02 legoktm: deploying https://gerrit.wikimedia.org/r/264605

2016-01-16

21:47 legoktm: deploying https://gerrit.wikimedia.org/r/264489
21:36 legoktm: deploying https://gerrit.wikimedia.org/r/264488
21:29 legoktm: deploying https://gerrit.wikimedia.org/r/264487
21:21 legoktm: deploying https://gerrit.wikimedia.org/r/264483 https://gerrit.wikimedia.org/r/264485
20:58 legoktm: deploying https://gerrit.wikimedia.org/r/264492
18:55 jzerebecki: reloadin zuul for 996c558..5f8eb50
09:12 legoktm: deploying https://gerrit.wikimedia.org/r/264448
09:01 legoktm: deploying https://gerrit.wikimedia.org/r/264446 and https://gerrit.wikimedia.org/r/264447
07:46 legoktm: sudo -u jenkins-deploy mv /mnt/jenkins-workspace/workspace/mediawiki-core-phplint /mnt/jenkins-workspace/workspace/mediawiki-core-php53lint on all precise slaves
07:17 legoktm: deploying https://gerrit.wikimedia.org/r/264444
06:31 legoktm: deploying https://gerrit.wikimedia.org/r/264441
06:10 legoktm: added phpflavor-php53 label to all phpflavor-zend slaves

2016-01-15

12:17 hashar: restarting Jenkins for plugins updates
02:49 bd808: Trying to fix submodules in deployment-bastion:/srv/mediawiki-staging/php-master/extensions for T123701

2016-01-14

20:06 legoktm: deploying https://gerrit.wikimedia.org/r/264122
19:32 legoktm: deploying https://gerrit.wikimedia.org/r/264114
19:18 legoktm: deploying https://gerrit.wikimedia.org/r/264108

2016-01-13

21:06 hashar: beta cluster code is up to date again. Got delayed by roughly 4 hours.
20:55 hashar: unlocked Jenkins jobs for beta cluster by disabling/reenabling Jenkins Gearman client
10:15 hashar: beta: fixed puppet on deployment-elastic06 . Was still using cert/hostname without .deployment-prep. .... Mass update occurring.

2016-01-12

23:30 legoktm: deploying https://gerrit.wikimedia.org/r/263757 https://gerrit.wikimedia.org/r/263756
13:32 hashar: beta cluster: running /usr/local/sbin/cleanup-pam-config
13:29 hashar: integration running /usr/local/sbin/cleanup-pam-config on slaves

2016-01-11

22:24 hashar: Deleting old references on Zuul-merger for mediawiki/core : /usr/share/python/zuul/bin/python /home/hashar/zuul-clear-refs.py --until 15 /srv/ssd/zuul/git/mediawiki/core
22:21 hashar: gallium in /srv/ssd/zuul/git/mediawiki/core$ git gc --prune=all && git remote update --prune
22:21 hashar: scandium in /srv/ssd/zuul/git/mediawiki/core$ git gc --prune=all && git remote update --prune
07:35 legoktm: deploying https://gerrit.wikimedia.org/r/263319

2016-01-07

23:16 legoktm: deleted /mnt/jenkins-workspace/workspace/mediawiki-extensions-qunit/src/extensions/PdfHandler/.git/refs/heads/wmf/1.26wmf16.lock on slave 1013
06:32 legoktm: deploying https://gerrit.wikimedia.org/r/262868
02:24 legoktm: deploying https://gerrit.wikimedia.org/r/262855
01:25 jzerebecki: reloading zuul for b0a5335..c16368a

2016-01-06

21:13 thcipriani: kicking integration puppetmaster, weird node unable to find definition.
21:11 jzerebecki: on scandium: sudo -u zuul rm -rf /srv/ssd/zuul/git/mediawiki/services/mathoid
21:04 legoktm: ^ on gallium
21:04 legoktm: manually deleted /srv/ssd/zuul/git/mediawiki/services/mathoid to force zuul to re-clone it
20:17 hashar: beta: dropped a few more /etc/apt/apt.conf.d/*-proxy files. webproxy is no more reachable from labs
09:44 hashar: CI/beta: deleting all git tags from /var/lib/git/operations/puppet and doing git repack
09:39 hashar: restoring puppet hacks on beta cluster puppetmaster.
09:35 hashar: beta/CI: salt -v '*' cmd.run 'rm -v /etc/apt/apt.conf.d/*-proxy' https://phabricator.wikimedia.org/T122953

2016-01-05

16:54 hashar_: Removed elastic search from CI slaves https://phabricator.wikimedia.org/T89083 https://gerrit.wikimedia.org/r/#/c/259301/
03:45 Krinkle: integration-slave-trusty-1015: rm -rf /mnt/home/jenkins-deploy/.npm per https://integration.wikimedia.org/ci/job/mediawiki-core-qunit/56577/console

2016-01-04

21:06 hashar: gallium has puppet enabled again
20:53 hashar: stopping puppet on gallium and live hacking Zuul configuration for https://phabricator.wikimedia.org/T122656

2016-01-02

03:17 yurik: purged varnishs on deployment-cache-text04

2016-01-01

22:17 bd808: No nodepool ci-jessie-* hosts seen in Jenkins interface and rake-jessie jobs backing up