Wikimedia Cloud Services team/EnhancementProposals/2020 Network refresh/2021-02-03-checkin
Appearance
2021-02-03 WMCS network checkin
agenda
- Q3 goals, how are they going
- https://wikitech.wikimedia.org/wiki/News/CloudVPS_NAT_wikis
- this one, initially approached as a potential low hanging fruit, is proving to be way more challenging and will need to be delayed.
- see all subtasks of https://phabricator.wikimedia.org/T209011
- https://phabricator.wikimedia.org/T272397 cloud: drop NAT exception for dumps NFS
- might continue with this one instead, should be easier?
- https://wikitech.wikimedia.org/wiki/News/CloudVPS_NAT_wikis
- procurement of hardware for new edge network setup:
- renaming labtestvirt2003 to cloudgw https://phabricator.wikimedia.org/T271519 Done
- codfw cloudgw device procurement https://phabricator.wikimedia.org/T268016 Done
- codfw cloudsw https://phabricator.wikimedia.org/T272348 (needs discussion)
- eqiad cloudgw devices procurement https://phabricator.wikimedia.org/T270705 (ordered)
- Production Cloud services relationship review
- wiki replicas
notes
cloudVPS NAT
- CloudVPS NAT wiki changes: several moving parts
- faidon: how can we help
- arturo: we need some help on the communications side, but Joaquin doesn't have time this Q
- faidon: try talking to each team managers for coordination
- nicholas: timeline needs to be extended
- faidon: yes, ACK complexity
- arzhel: what about introducing a window, perform the change for 1h, see what happens, collect intel for a later final "date".
- faidon: ideally we don't need 5 teams green light, that sounds like too much. Faidon can handle part of the internal comms within the SRE sub teams
- faidon: what about drop not every exception at the same time but progressively
- bstorm: bot accounts store IP addresses, how do we handle that
- arturo: we could drop requests per DC
- faidon: All traffic should be running through eqiad
- brandon: this is a large fraction of traffic coming from a single IP address. Our services are designed for a different case.
- faidon: let's try to break down the problem into smaller pieces
- brandon: if we were talking about 8 or 16 different source IP address, then the thing would be different
- nicholas: there are risks and concerns surronding this whole project, perhaps we can introduce a task in the form of a blocker
- How to do NAT pooling?
- faidon: can we patch neutron?
- arturo: we are moving away from patching
- arzhel: ipv6 would help here
- faidon: want to avoid tying this work to ipv6
wiki replicas
- brandon: Are we trying to get rid of cloud VLANS or ?
- bstorm: labs VLAN trying to go away. However, the wiki rpelicas design was intended to reuse existing network design, so they inherited it
- brandon: What other services will be LVS? Are there more VLANs coming?
- arturo: Understand LVS to be part of solution for handling "public" traffic.
- faidon: Why do wiki replicas today need to be in?
- bstorm: no technical reason. Legacy, presumption?
- faidon: access by anything besides NAT'd network?
- bstorm: dbproxy1018/19 are still accessed the legacy way. Would need to be changed first. New replica ports are out on LVS, but nothing else.
- brandon: for things moving forward to go through LVS, can things like dbproxy live in production VLANS or do things need to stay in labs VLANS.
- bstorm: should be possible to change.. account creation is done inside production realm. No LVS required.
- bstorm: Dumps NFS might be a possible service to move to LVS. Don't need write locks, so maybe?
- arturo: Expection is wiki replicas is an exception, and future services will do something else
- faidon: Should plan for LVS future. Understand migration and timelines
- nicholas: once the old cluster is gone, what's blocking?
- faidon: the old cluster is accessed by cloud private addresses. The new cluster doesn't need to. But the new proxies lives in the cloud-support vlan, which has implications for LVS.
- faidon: if being used by cloud private ips, don't renumber. Remove the use case, and then renumber to solve
- arturo: very small machines, easily fixed
- nicholas: perhaps by the end of the FY we can get rid of the old cluster
- faidon: if you end up thinking that procuring a couple new proxy servers would make things easier, then go for it