Swift/Load Thumbnail Data
This page details the method of loading existing thumbnails into swift pre-deploy.
new method
ms5 is currently suffering severe load. To avoid increasing that load, we don't want to run 'find' on ms5 (the old method).
capture incoming thumbnail requests
On ms5, we run tcpdump to capture incoming thumbnail requests. We ship them off to fenari, where they're processed and requested from squids (which have just received the object back from ms5) where they're served from cache
root@ms5:~# tcpdump -i any -s 0 dst port 80 and dst host ms5.pmtpa.wmnet -A | grep GET \ | grep -o "[^ ]*/thumb/[^ ]*" | nc fenari.wikimedia.org 29876
stuff images into swift
on fenari, we run a listener that
- processes the incoming list of URLs from ms5's tcpdump
- holds on to each URL for about 30s to make sure it's present in squid
- requests the URL from swift, which falls through to upload.wikimedia.org (the squids) on 404
ben@fenari:~/swift$ ./urllistener-fifo
watching progress
Ganglia is graphing the total number of objects as well as the number of new objects per 30s time slice. If either of these metrics stagnate, we should verify the tcpdump and urllistener are still runnning (in a screen session ben/listener on fenari - you can connect as root)
old method (as of 2012-01-30)
success condition
The goal is only 99% coverage; it's ok to not get 100% of thumbnails. Any that we miss will be picked up from ms5 or regenerated when they're requested.
additionally, only publicly visible thumbnails are getting retrieved for this first iteration
get a list of existing thumbnails from ms5
ionice -c 3 specifies the 'idle' priority.
cd /export/thumbs for i in wikibooks wikimedia wikinews wikipedia wikiquote wikisource wikiversity wiktionary do ionice -c 3 find $i -type f > /tmp/$i-filelist.txt done
exclude commons for test clusters
Until we clear out the google-generated stuff, it's just wasted space. We should skip anything on commons for testing. Note - for production; we should include commons, aka skip this step
for i in *-filelist.txt; do grep -v "/commons/" $i > ${i/.txt/-nocommons.txt}; done
transform the list into URLs
Turn the list of paths into a list of URLs that should be provided by swift. Loading each URL will cause the file to be fetched from ms5 and saved to swift
swift_frontend="msfe-pmtpa-test.wikimedia.org:8080" for i in *-nocommons.txt; do cat $i | sed -e "s/^/http:\/\/${swift_frontend}\//" > ${i/.txt/-urls.txt}; done
load all these urls
fetch them all from swift - run this on hume or some other host in pmtpa
cd ~ben/swift for i in *-urls.txt; do ./geturls.py -t 30 $i; done