Obsolete:Ehcache
Ehcache was a disk-backed object cache that Wikimedia briefly experimented with for a time in 2013. See mw:Disk-backed object cache for the rationale behind the experiment, and mw:Disk-backed object cache/status for the outcome.
This page served as the operations manual.
Maintenance
- Start/stop/restart
- /etc/init.d/tomcat6 {start|stop|restart}
- Log files
- /var/log/tomcat6/
- Process monitoring
- argv[0] is "java" so use ps -C java etc.
- Service monitoring
- curl -s 'http://db40:8080/ehcache/rest/mw/' | xmllint --format -
Installation
- apt-get install tomcat6 libslf4j-java
- echo "ulimit -n 60000" >> /etc/default/tomcat6
- Get the latest version of ehcache-server from http://sourceforge.net/projects/ehcache/files/ehcache-server/
- Unpack the .tar.gz file into a temporary directory.
- The .war file is a JAR (i.e. zip) file. Unzip it into /var/lib/tomcat6/webapps/ehcache. Tomcat scans the webapps directory on startup and makes the subdirectory "ehcache" into the URL path for the servlet.
- unzip ehcache-server-1.0.0.war -d /var/lib/tomcat6/webapps/ehcache
- Replace the broken libslf4j JAR that came bundled in the .war with the one from the distro. This avoids linkage conflicts when tomcat starts up.
- rm /var/lib/tomcat6/webapps/ehcache/WEB-INF/lib/slf4j-api-1.5.8.jar
- cp /usr/share/java/{slf4j-simple-1.5.10.jar,slf4j-api-1.5.10.jar} /var/lib/tomcat6/shared/
- cd /var/lib/tomcat6/webapps/ehcache/WEB-INF && mv server_security_config.xml_rename_to_activate server_security_config.xml
- Create a data directory for it.
- install -d -o tomcat6 -g tomcat6 /a/ehcache
- /etc/init.d/tomcat6 restart
Configuration
Most server configuration is done in /var/lib/tomcat6/webapps/ehcache/WEB-INF/classes/ehcache.xml Here are some things I changed:
Change the diskStore configuration to:
<diskStore path="/a/ehcache"/>
Add this:
<cache name="mw" maxElementsInMemory="100000" maxElementsOnDisk="260000000" eternal="false" overflowToDisk="true" diskPersistent="true" memoryStoreEvictionPolicy="LFU" />
The average object size is around 4KB.
Rationale
Our parser cache hit ratio is very low, around 30% (see the graph)
This seems to be mostly due to insufficient parser cache size. Tim's theory is that if we increased the parser cache size by a factor of 10-100, then most of the yellow area on that graph should go away. This would reduce our Apache CPU usage substantially.
The parser cache does not have particularly stringent latency requirements, since most requests only do a single parser cache fetch.
As for why Ehcache: it stood out since it has a suitable feature set out of box and was easy to use from PHP. Tim created a MediaWiki client for it and committed it in r83208.
Plan
Here is the current plan:
- (done) Configure MediaWiki so that it replicates its writes to both memcached and Ehcache. This keeps both caches up to date and allows the Ehcache deployment to be reverted at any time, without a significant performance impact.
- (done) Deploy to testwiki. Observe request latencies.
- (done but reverted) Deploy to a wiki with low levels of real traffic, like en.wikiquote.org. Observe server resource utilisation.
- (done but reverted: 2011-03-28) Deploy to a wiki which will provide a substantial fraction of the full load, like en.wikipedia.org. Observe server resource utilisation and calculate whether it's feasible to deploy to all wikis.
- Deploy to all wikis. Allow the cache to fill up, say over 2 weeks, observing parser cache hit ratios and server resource utilisation.
- Revert the deployment and return db40 to the database pool. Puppetize and package Ehcache for use on new dedicated hardware.
- Deploy Ehcache to all wikis without write replication to memcached.
Status
(updated 2011-03-28) The initial test deployment was done with the default Ehcache configuration, which includes the Glassfish application server. Putting Ehcache under full load revealed a bug with how Glassfish deals with running out of file descriptors, which is for each worker thread to sleep for one second waiting for the descriptor count to go down. This just compounds the problem by temporarily taking a worker thread out of commission.
A new test will be attempted using Ehcache + Tomcat, which hopefully won't exhibit this behavior.