Wikimedia Cloud Services team/EnhancementProposals/2020 Network refresh/2020-09-30-checkin
Appearance
Agenda:
- PoC Updates
- Proposal Updates/Feedback
Dallas
- Pending vlan changes being worked on
- Almost ready
- Need to test the NATs involved: https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/2020_Network_refresh#/media/File:WMCS_network-NAT.png
Feedback:
- Document is much clearer now :-)
- clarify firewalling aspect. Having a perimetral firewall is a new thing for cloudvps
- How could we prioritize security concerns of both SRE and WMCS? Can we close the dmz_cidr loophole this year?
Questions:
- Can the existing technical debt be quantified?
- Code in neutron limits WMCS, prevents upgrades. Maintaining own patches. Applying the patch isn't a ton of work, but have to verify patch still works, nothing broken, etc
- Caused outages? Routing has broken, services broken in the past. Typically we find issues before releasing. Requires hand validation. ~1-2 weeks?
- inability to have network isolation between tenants is a bigger longterm issue than the dmz patches.
- current network topology is flat, upstream uses tenant networking
- overall, maintenance, improvements, and security concerns with existing patches
- NAT exceptions, dmz_cidr, we want to remove these right?
- Yes, we want to deprecate overall. And the existing exceptions we want to get out of neutron
- option 3 doesn't move floating ip, it moves NAT for pool of tenants between world and prod?
- Yes. Also firewalling. The only firewall today is the core router, so want to move firewall as well.
- What kind of firewalling? What loads?
- Core router policy firewall. Allow contacting supporting services (nfs, wikireplicas). Managing policy has proven difficult. Policies that exist in core routers protect prod from tenants. Stateless firewall today.
- If firewall is protecting production, it can't move outside of production right?
- WMCS can't block unwanted services from crossing
- VM specific policy could be relocated
- until dmz_cidr is gone, core routers must retain firewalling policies (or at extension of prod network)
- What's preventing us today from limiting network under dmz_cidr? Limiting subnets, limiting network, etc
- We tried to do this specifically in the past, but has implications on wiki communities. Rate limits. Unclear how this would impact communities. For NFS, we have to know which server
- time and expertise?
- last time we tried, we had to rollback. Agreed it was a hard task.
- But is a dependency a requirement? Could we work things in a different order?
- Yes, it's possible. Doing this work, it unblocks other things. Freeing the NAT clarifies how to remove it.
- Is it possible to delay reducing technical debt within WMCS without puytting things at risk?
- What timing concerns does SRE have with the current proposal?
- Want to secure infra as much as possible, and do so under Frontline Defenses this year
- Big network security issues from cloud into production for this reason
- Work has been delayed for this larger effort. But the loophole needs to be closed this year
- Want to secure infra as much as possible, and do so under Frontline Defenses this year
- This NAT has been a longstanding issue, and needs resolved
- Can we move forward on Arhzel's goals?
- Yes. UIsing BGP, stop exposing core router. In
- Can we move forward on option 3?
- Still TBD