Swift/Deploy Plan - Originals Part 2
Last steps for move from ms7 (media server for originals) to swift for reads
what we need to do to switch upload/foo originals to read from swift
Blacklist approach
1. block authenticated requests to swift at squid:
- I hate blacklists instead of whitelists, but this will probably do the job:
- reject swift authenticated requests. probably should do some header checking too...
- acl swift_auth urlpath_regex ^/(auth|v[^/]+/AUTH).*
- http_access deny swift_auth
- might be better to use the same pattern as in rewrite.py (that specifically catches {32,36} hex chars:
- reject swift authenticated requests. probably should do some header checking too...
248 # If it already has AUTH, presume that it's good. #07. fixes bug 33620 249 hasauth = re.search('/AUTH_[0-9a-fA-F-]{32,36}', req.path) 250 if req.path.startswith('/auth') or hasauth: 251 return self.app(env, start_response)
- why? there's no way ^/v[^/]+/AUTH.* is going to match any files. I don't like encoding the same logic over and over across config and systems. What if we change the length of the token at some point? will we remember to change squid.conf too?
2. Auth URLs are not enough to protect swift. We should probably filter out X-Authenticate headers as well non-GET/HEAD methods.
- Can't Swift protect itself based on IP ranges?
3. check swift acls so that private wiki's "public" container is not world readable
- eg office wiki is set correctly
- what's the full list of "private" wikis?
- method of testing:
- swift -A http://127.0.0.1/auth/v1.0 -U mw:thumb -K $pass stat wikipedia-office-local-public
- shows blank Read ACL: instead of mw:thumb,.r:*
4. upload URLs to swift:
- normal. url pattern: http://upload.wikimedia.org/wikipedia/commons/a/a2/Little_kitten_.jpg
- exception: math needs to stay on NFS
- What is the status of swift awareness of the math code in MW anyway?
- two patterns:
- http://upload.wikimedia.org/math/d/b/e/dbe26cb91f8356077ad0a05f5f38ed9b.png
- /math/3/8/a/38ac6c8c311d23e55328cca451eeed23.png (not sure about this one)
- -> so use urlpath_regex instead of urlregex? (I think it's still upload; I'm not sure why tcpdump didn't give me the hostname. the interesting part is the url is anchored at math instead of project/language/math)
- proxy vs non-proxy requests? yes, but who doesn't use the proxy-style url when asking an image scaler a question? mediawiki?
- exception: graphs and timeline extensions
- all of these caught by:
+# math extension still requires NFS. send these to ms7 until we can fix that. +acl ms7_math urlpath_regex ^/[^/]+/[^/]+/(graphs|math|timeline)/.* +cache_peer_access 10.0.0.246 allow ms7_math +cache_peer_access 10.0.0.246 deny all
5. Other stuff that's on ms7 that swift doesn't (yet?) have (this list comes from ls /mnt/upload6/):
See Cruft on ms7 for this list with annotations, and make changes there.
- dirs
- math (as mentioned above)
- ext-dist
- jars
- portal
- private
- skins
- scripts
- sync-from-home (aka scripts) (this can go)
- lost-image-thumb-backup
- files
- pybaltestfile.txt !!! <-- swift does have this at monitoring/pybal.txt; see lvs.pp for full URL but rewrite.py will probably have to be modified to serve this file directly
- robots.txt
- index.html
- favicon.ico
- x1
- mime.php
overall approach:
- set swift as a default target and blacklist things that shouldn't get there?
- must blacklist auth, things that haven't moved to swift yet
- whitelist each thing we transition from ms7 to swift leaving the default on ms7?
- then watch ms7's traffic and see when stuff stops coming in
Whitelist (or combination) approach:
still blacklist auth attempts, just cuz we can
- acl swift_auth1 urlpath_regex ^/auth
- acl swift_auth2 urlpath_regex v[0-9]+/AUTH_.*
- acl swift_auth3 urlpath_regex AUTH_[0-9a-fA-F-]{32,36}
- needs testing, does {digits} work?
- http_access deny swift_auth1
- http_access deny swift_auth2
- http_access deny swift_auth3
existing squid acl for thumbs:
- acl swift_thumbs url_regex ^http://upload\.wikimedia\.org(/+)[^/][^/]*/[^/][^/]*/thumb/
- original URL patterns:
- http://upload.wikimedia.org/wikipedia/commons/8/84/W_logo_for_Mobile_Frontend.gif
squid acl for originals: (Aaron's suggestion)
- acl swift_origs url_regex ^http://upload\.wikimedia.org/(wikibooks%7Cwikinews%7Cwikiquote%7Cwikiversity%7Cwikimedia%7Cwikipedia%7Cwikisource%7Cwiktionary)/[^/]+/(archive/)?[0-9a-f]/[0-9a-f][0-9a-f]/
- acl swift_temps url_regex ^http://upload\.wikimedia.org/(wikibooks%7Cwikinews%7Cwikiquote%7Cwikiversity%7Cwikimedia%7Cwikipedia%7Cwikisource%7Cwiktionary)/[^/]+/temp/[0-9a-f]/[0-9a-f][0-9a-f]/
- acl swift_thumbs url_regex ^http://upload\.wikimedia.org/(wikibooks%7Cwikinews%7Cwikiquote%7Cwikiversity%7Cwikimedia%7Cwikipedia%7Cwikisource%7Cwiktionary)/[^/]+/thumb/(archive/|temp/)?[0-9a-f]/[0-9a-f][0-9a-f]/
Prep first:
- test that non head/get to upload are rejected. we think so based on front end config but best to try some requests
- see if swift is really missing a pile of thuimbs (how?) -- skipping and hoping
How to test on a single squid
- Want to take it out of front and backend service.
- Out of front end: enabled = false in pybal config
- why frontend? it shouldn't matter. backend is what we need
- indeed, frontend doesn't need depooling
- if the test squid front end is active, it will have itself listed in the backend list I think, since when we deploy the files generated without the test sqid in the ocnf, its own file isn't regenerted.
- Out of back end: ?
- (after irc discussion) remove squid from config file, generate, edit the front end config file for the test squid to remove itself from the back end list, push this change to *all* (so esams gets the update) (and to the specific host manually maybe?) note if you don't fix it on pppet master puppet will overwrite your edited file on the test squid frontend :-/
- revert the removal, make back end config changes for our test, generate
- after generating squid configs, deploy only to that host, 'cache' as type
- Out of front end: enabled = false in pybal config
No lab project we can test squid confs for swift in, right?
Do these later (not during this window)?
squid acl for math (once it lives on swift; requires rewrite.py changes)
- acl swift_math url_regex ^http://upload\.wikimedia\.org(/+)[^/]+/[^/]+/math/
- acl swift_math2 url_regex ^http://upload\.wikimedia\.org(/+)/math/
squid acl for ext-dist: (requires rewrite.py changes)
- acl swift_extdist url_regex ^http://upload\.wikimedia\.org(/+)ext-dist/
etc...
Tests (and expected resurlts)
- random /math/: MISS, 200 OK, Sun-Java-System-Web-Server
- random thumb: MISS, 200 OK, swift (X-Object-Meta-Sha1base36)
- random orig: MISS, 200 OK, swift
- random /archive/: MISS, 200 OK, swift
- request using swift syntax to monitoring container/file MISS (403 from squid)
- curl -v -H "User-Agent: benfoo" -H "Host: upload.wikimedia.org" http://sq51.wikimedia.org:3128/v1/AUTH_XXXX/monitoring/pybaltestfile.txt
- OPTIONS on backend: 405 method not allowed (presumably from swift); OPTIONS on frontend 403 forbidden
What happened
Spike in load on image scalers, much iowait, small load increase on ms5, scalers eventually became unresponsive. Later found nfs server timeout messages in the logs on the scalers. After revert this situation continued for awhile until eventually load on scalers dropped sharply, at the same time that load on ms5 returned to normal.
During the same interval, we saw some http requests to ms5 for thumbs and images, about 5 get requests a second. Note that this is nothing compared to traffic it used to handle, about 40/sec.
However... it's 100% full (there's about 101gb free on ms5). We already know form experience that the more thumbs are in these directories, the slower it gets, and that we will eventually see nfs timeouts. I don't know if we've ver run at this close to the edge.