2021-02-03 WMCS network checkin

agenda

Q3 goals, how are they going
- https://wikitech.wikimedia.org/wiki/News/CloudVPS_NAT_wikis
  - this one, initially approached as a potential low hanging fruit, is proving to be way more challenging and will need to be delayed.
  - see all subtasks of https://phabricator.wikimedia.org/T209011
- https://phabricator.wikimedia.org/T272397 cloud: drop NAT exception for dumps NFS
  - might continue with this one instead, should be easier?

procurement of hardware for new edge network setup:
- renaming labtestvirt2003 to cloudgw https://phabricator.wikimedia.org/T271519 Done
- codfw cloudgw device procurement https://phabricator.wikimedia.org/T268016 Done
- codfw cloudsw https://phabricator.wikimedia.org/T272348 (needs discussion)
- eqiad cloudgw devices procurement https://phabricator.wikimedia.org/T270705 (ordered)

Production Cloud services relationship review
- https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Production_Cloud_services_relationship

wiki replicas

notes

cloudVPS NAT

CloudVPS NAT wiki changes: several moving parts
faidon: how can we help
arturo: we need some help on the communications side, but Joaquin doesn't have time this Q
faidon: try talking to each team managers for coordination
nicholas: timeline needs to be extended
faidon: yes, ACK complexity
arzhel: what about introducing a window, perform the change for 1h, see what happens, collect intel for a later final "date".
faidon: ideally we don't need 5 teams green light, that sounds like too much. Faidon can handle part of the internal comms within the SRE sub teams
faidon: what about drop not every exception at the same time but progressively
bstorm: bot accounts store IP addresses, how do we handle that
arturo: we could drop requests per DC
faidon: All traffic should be running through eqiad
brandon: this is a large fraction of traffic coming from a single IP address. Our services are designed for a different case.
faidon: let's try to break down the problem into smaller pieces
brandon: if we were talking about 8 or 16 different source IP address, then the thing would be different
nicholas: there are risks and concerns surronding this whole project, perhaps we can introduce a task in the form of a blocker

How to do NAT pooling?
faidon: can we patch neutron?
arturo: we are moving away from patching
arzhel: ipv6 would help here
faidon: want to avoid tying this work to ipv6

wiki replicas

brandon: Are we trying to get rid of cloud VLANS or ?
bstorm: labs VLAN trying to go away. However, the wiki rpelicas design was intended to reuse existing network design, so they inherited it
brandon: What other services will be LVS? Are there more VLANs coming?
arturo: Understand LVS to be part of solution for handling "public" traffic.
faidon: Why do wiki replicas today need to be in?
bstorm: no technical reason. Legacy, presumption?
faidon: access by anything besides NAT'd network?
bstorm: dbproxy1018/19 are still accessed the legacy way. Would need to be changed first. New replica ports are out on LVS, but nothing else.
brandon: for things moving forward to go through LVS, can things like dbproxy live in production VLANS or do things need to stay in labs VLANS.
bstorm: should be possible to change.. account creation is done inside production realm. No LVS required.
bstorm: Dumps NFS might be a possible service to move to LVS. Don't need write locks, so maybe?
arturo: Expection is wiki replicas is an exception, and future services will do something else
faidon: Should plan for LVS future. Understand migration and timelines
nicholas: once the old cluster is gone, what's blocking?
faidon: the old cluster is accessed by cloud private addresses. The new cluster doesn't need to. But the new proxies lives in the cloud-support vlan, which has implications for LVS.
faidon: if being used by cloud private ips, don't renumber. Remove the use case, and then renumber to solve
arturo: very small machines, easily fixed
nicholas: perhaps by the end of the FY we can get rid of the old cluster
faidon: if you end up thinking that procuring a couple new proxy servers would make things easier, then go for it