Undergoing (Too Much) Site Maintenance

January 25, 2008 Ian Hamilton Technology

Many users have seen the "undergoing routine site maintenance" page splashed across their browsers a bit too often recently when browsing to the PLoS ONE, PLoS Neglected Tropical Diseases or PLoS Hub Clinical Trials journals. We've taken some steps to resolve the problems and the TOPAZ developers are digging deep into code. We won't have a full fix in place until TOPAZ RC 0.8.2.1 is released at the end of February, but bandaids are in place that should make the site outages much less frequent. If you're interested in the gorey details, read on….

Russ and I were horrified witnesses to Java processes (that use ehcache) eating away at the 4Gb of memory on the publishing application (pubApp) servers. Eventually, all the memory would fill up and the TOPAZ application would turn into a zombie. To try and remedy the problem, we setup a cron job to automatically restart the pubApp every eight hours. But this caused a secondary problem – if Mulgara was in the middle of a transaction while the pubApp was restarting, the application would hang.

We have alerts setup any time that the pubApp hangs but now Russ and I were waking up in the middle of the night from the unpleasant sounds of SMS alerts inundating our cell phones. And after a few nights of that, we were ready to:

1. Throw the cell phones out the window at 4AM

2. Purchase more memory for the servers.

Yesterday, Josh and I installed 4Gb more RAM (total of 8gb) in the servers. We're still seeing the Java process eat up a bunch of memory, but it's not able to topple the 8Gb of RAM. We see ehcache having some problems when it overflows from memory to disk. And we still have automatic restarts a few times a day. But the sites should be more stable and we'll hopefully have this beast tamed by the next release.