Obsolete:Eqiad Migration Planning
Appearance
Coordination
- We now have an incomplete tracking ticket in RT that depends on more specific tickets.
- Platform Engineering will be using Bug 39106 for tracking dev tasks
- Sept 12 Update - http://etherpad.wikimedia.org/TechOps-12Sept2012
- 3rd Jan Update - http://etherpad.wmflabs.org/pad/p/EqiadMigration-3Jan2013
- 8th Jan Update - http://etherpad.wmflabs.org/pad/p/EqiadMigration-8Jan2013
- Weekly Countdown meeting http://etherpad.wmflabs.org/pad/p/EqiadMigration - meeting minutes
- Checklist and acceptance tests culled from this page [IN PROCESS] /Checklist
Outstanding Server/System Readiness
- Master RT - https://rt.wikimedia.org/Ticket/Display.html?id=3403
- Master Bugzilla - * https://bugzilla.wikimedia.org/show_bug.cgi?id=39106 Master Bugzilla
- App, Imagescalers, Bits, Jobrunners and API Apaches
- All Ready - awaiting code deploy
- Parsoid servers@Eqiad
- Target - 1/11/13 (RobH)
- setup pc1001-1003 (PY/Asher)- https://rt.wikimedia.org/Ticket/Display.html?id=3644 / bugzilla http://bugzilla.wikimedia.org/42463
- Deployed 1/14/13
- Setup Ceph in eqiad for image storage (Swift in Tampa & Ceph in EQIAD) (Faidon/Mark)
- 2 more servers set up (up to 4 now), intra-cluster replication ETA is Saturday early morning PST
- holding off adding more as to not disrupt swift->ceph replication speed
- swift->ceph copy 17.5T out of 43TB, complete in 12 days (very rough estimate)
- some stability issues - close cooperation with Ceph developers, being fixed realtime
- h310 perc issue - workaround with raid 0
- 0.56 has been released and deployed to the eqiad cluster
- various other hiccups, both hardware & software related
- still pending: puppetization, rewrite.py -> VCL, testing with MediaWiki
- Remove production MediaWiki dependency on NFS - https://bugzilla.wikimedia.org/show_bug.cgi?id=43495 (RT-4183)
- https://bugzilla.wikimedia.org/37946: Add support for git branches to ExtensionDistributor - Chad
- by end of 1/12/13 - testing
- https://bugzilla.wikimedia.org/43493: Debug and reenable Swift-based CAPTCHA (*) - Aaron
- being tested using Swift-based captcha
- https://bugzilla.wikimedia.org/37946: Add support for git branches to ExtensionDistributor - Chad
- Database Master switchover (PY / Asher)
- MHA
- https://bugzilla.wikimedia.org/show_bug.cgi?id=43338 - Dev tasks related to git-deploy migration; ready and use it on 1/16/13
- https://bugzilla.wikimedia.org/43339: Deploy git-deploy to the Beta Cluster - Antoine
- https://bugzilla.wikimedia.org/43614 l10n generation in git-deploy - Brad / Ryan
- create localization directory, etc
- Tim to review
- https://bugzilla.wikimedia.org/43340 - Design new on-disk layout for MediaWiki install on tin/eqiad Apaches - Sam/Tim
- https://bugzilla.wikimedia.org/43615 Audit of the salt scripts for completeness (looking in current scripts) - Aaron
- mwscript
- https://bugzilla.wikimedia.org/show_bug.cgi?id=43338 - Dev tasks related to git-deploy migration; ready and use it on 1/16/13
- https://bugzilla.wikimedia.org/show_bug.cgi?id=39082 - Add support for deploying per-datacenter config variances - Antoine
- multi-datacenter support - Antoine
- https://bugzilla.wikimedia.org/show_bug.cgi?id=39082 - Add support for deploying per-datacenter config variances - Antoine
- https://bugzilla.wikimedia.org/show_bug.cgi?id=43453 - Checklist/script to switch datacenters - Tim
- Automated DB/Apache switchcover script
- Tampa - Read-only
- Eqiad - Grants needed
- See "Actually Failing Over" below.
- varnish configuration switchover script - Mark
- Automated DB/Apache switchcover script
- https://bugzilla.wikimedia.org/show_bug.cgi?id=43453 - Checklist/script to switch datacenters - Tim
Software / Config Requirements
- MediaWiki deploy support for per colo config variances (Bugzilla 39082)
- generating eqiad and pmtpa dsh groups
- mostly done - rolling out by end of month https://gerrit.wikimedia.org/r/#/c/32167/ https://gerrit.wikimedia.org/r/#/c/32168/ ..
- new mediawiki conf files for eqiad
- replicating the git checkouts, etc. to new /home
- not an issue
Actually Failing Over
- https://bugzilla.wikimedia.org/show_bug.cgi?id=43453 - Checklist/script to switch datacenters - Tim
- Automated DB/Apache swithcover script
- Tampa - Read-only
- Eqiad - Grants needed
- See "Sequence" below.
- Automated DB/Apache swithcover script
- Sequence (-AI Asher)
- deploy db.php with all shards set to read-only in both pmtpa and eqiad
- redis failover - setting mc1001-1016 as masters, mc1-16 slaving from eqiad
- deploy squid and mobile + bits varnish configs pointing to eqiad apaches
- start with read-only mode
- try to bypass puppet / must be within 1 minute or 2
- database warmup - scripting select query collection for every project, and warmup of all eqiad dbs
- master swap every core db and writable es shard to eqiad
- deploy db.php in eqiad removing the read-only flag, leave it read-only in pmtpa
- the above master-swap + db.php deploys can be done shard by shard to limit the time certain projects are read-only
- No DNS or Ceph/Swift changes required
- Rollback plan - needs to add details
- turn off multi-write to NAS & turn on multi-write to Ceph
- TEST! TEST! TEST!
Deployment- D-day
- Day minus 1 (1/21/13) preparation Work
- Automated test run
- determine if deploying bits early is a possibility
- D-Day 1/22/13
- see actual failover paragrah above
- D-day + 1 1/23/13
Risk & Mitigation
Identify the high risk migration tasks and ensure we have a way to mitigate or revert without extended downtime.
- What could cause fallback to Tampa a big problem should migration failed?
- should Ceph fail?
- should Swift@Tampa fail?
- Database integrity
- Performance
- Need to determine Switchback Threshold - ??
- 2a.Test checklist: - /Checklist
Improving Switchover
- pre-generate squid + varnish configs for different primary datacenter roles
- implement MHA to better automate the mysql master failovers
- migrate session storage to redis, with redundant replicas across colos
See more
- Records and original tracking doc - http://etherpad.wikimedia.org/EQIAD-rollout-sequence
- Category:Eqiad cluster
Parking Lot Issues
- Identify and plan around the deployment/migration date -
tentatively Oct 15, 2012[see below]. Need to communicate date.- Migration needs to happen before Fundraising season starts in Nov.
- Vacation 'freeze'; all hands on deck week before and after deployment
- migrate ns1 from tampa to ashburn, but not a critical item.
- An update from CT Woo from October 2012 regarding the status of the migration is available here. It looks like it'll be pushed back to January or February 2013 (post-annual fundraiser).
- Hume equivalent (misc::maintenance) - postponed
- https://rt.wikimedia.org/Ticket/Display.html?id=1279 - allocate 1 box in eqiad for puppet testing
- not critical/not showstopper
- create/doc CheckList - PY/ChrisM
- Test checklist: /Checklist
AI - a automated test scripts - ChrisM
Use Cases - Tests
- Developer
- Check-in/out codes
- code review
- Code push/deploy
- revert deployment
- User
- registers
- search article
- read article
- comment on article
- edit article
- create article
- localization
- Community member
- tag article
- (exercise special pages features)
- Ops
- monitoring works - ganglia, nagios, torrus, .....
- check amanda backups