When you choose to publish with PLOS, your research makes an impact. Make your work accessible to all, without restrictions, and accelerate scientific discovery with options like preprints and published peer review that make your work more Open.

PLOS BLOGS The Official PLOS Blog

Performance Issues of PLoS Websites

We’re continuing to experience slowness and intermittent downtime for the websites hosted on Topaz. At this time, the priorities for the IT and Topaz teams are to improve stability of the Topaz applications and increase the performance of the web sites.

The IT and Topaz teams are digging through many system files, configuration settings and code to improve the situation but it may take one or two weeks before we overcome the performance issues. Please bear with us as during this time as we make additional tweaks/upgrades to improve performance.

Hopefully we’ll address these issues quickly so that our users can enjoy the journal websites.

Discussion
  1. How about offering direct links from TOC to frequently downloaded links?

    This would potentially reduce load on your servers and be a convenience for users.

    For my purposes these links would be those to pdf and to the citation but an analysis of your download patterns would tell you if this is a high-traffic route or not.

    Of course it may be the case that most people just read articles in html in which case your current links are sufficient.

    Bill

  2. We’ve found that simple search is impacting the performance of the journal sites. We have disabled both simple and advanced search and redirecting all search requests to our favorite 503 “site maintenance” page. We’re going to dig into the code and see why search queries are taking so long to return data.

    We’ve gone to DefCon 5 at PLoS to resolve the performance issues. It’s time to lock everybody into a conference room, order a bunch of pizza and fix every performance problem on the websites.

  3. We’ve had a good day of investigation and will apply patches to the production servers tomorrow. We now have a third pubApp instance behind our firewall that will share its cache with the other pubApps. This will help performance significantly as the pubApp cache will no longer need to be re-filled on restart (usually takes a few hours).

    We’ve uncovered other issues today and will continue digging into the belly of the beast. I expect to have more patches to increase stability/performance in place shortly. Search will be down for ~week while we sort through higher priorities.

  4. The sites have been more stable with the additional server seeding the pubApp caches. We were able to ingest articles today without taking the sites down and have had only 15 minutes of restart/downtime today (don’t mention that there are still 6.5 hours left in the day).

    Russ applied a few patches but rather than solving problems, they uncovered more mysteries. But we now have a way to clear a single article cache without clearing all caches. Previously we would have had to clear the cache for all articles to re-ingest an article. This will be a significant help performance as we only need to clear a small number of caches rather than hundreds (which would take hours to re-fill).

    More tomorrow….

  5. We had a “cascade event” ~5:45am which backed up Mulgara. This event happens when a query pulls so much data from Mulgara that all other queries are waiting for it to finish. This happens infrequently and depending on site traffic, Mulgara is able to gracefully handle the back log of requests. But this morning, Mulgara was unable to clear out the queue. This query is hard to track down (the saying “needle in a haystack” is appropriate) and we’ve upped the logging levels to trace all queries sent to Mulgara.

    Unfortunately, we also had a human error this morning. When I restarted the applications after the cascade event, a new article XSL file caused articles to throw errors trying to parse the article XML. Articles were serving “page not found” which resulted in more cascade events until I could revert the XSL file. We tested the XSL file on dev/stage without an error so something else must have snuck into the release.

    Now that the XSL file is fixed, the sites are purring. We’ve also identified another cache fix that will significantly help performance and a patch should be up today.

  6. Since this morning’s events, the sites have been performing great. Russ applied patches to resolve this morning’s XSL issue and to disable an article cache flush. We realized that all cache keys for images were being removed if a user added a note or rating to an article. These actions shouldn’t affect the static article images, so the flush of this cache was disabled. There are now over 10k images in this cache (affectionately know as SPCF), which will improve performance.

    After enabling detailed logging of queries to Mulgara, we haven’t had a query hang retrieving data (go figure). I’ve found a few queries that are taking a long time to retrieve data and will continue digging through the logs. We ingested articles throughout the afternoon without any disturbance to the Topaz force. We’ll (hopefully) cut a new release next week with a more updates to cache and we will re-enable search.

    This has been a painful week for PLoS, authors, the communities of PLoS Computational Biology, PLoS Genetics and PLoS Pathogens, and users of the sites. We’ve slept little but have learned a LOT in the past week. We’re closing in on the last bit of issues and should have a much more stable platform soon.

  7. We’re making more progress on a couple of fronts. Russ has been able to demonstrate that memory is being gobbled up on the servers. We see memory filling up on the pubApp servers over the course of 6-8 hours. Once the memory fills up, we see a high CPU usage problaby due to ehcache having to invalidate items from the cache on every operation. Russ is setting up a profiler to examine memory use on our test servers to see if he can reproduce the problem under load. We’re still not exactly sure where the memory leak occurs but have a couple of ideas:

    • ehcache
    • Apache Axis memory mapper / SOAP server
    • Dead mod_jk connections

    We’ve also tracked down a number of actions that are querying Mulgara directly rather than using cache including some queries than can return a huge amount of data. The developers are re-writing the code to use the cache or to be smarter about data they’re pulling.
    We’re not going to make any more patches to the production server until we diagnose and fix the memory problem. Search will be reinstated back on the site after we can build and test a new release candidate in the next 1-2 weeks.

  8. Overall performance has been good since last week. We have to restart the stack three times a day due to the memory problem but these outages last less than 10 minutes. Since we have almost all caches shared with the new server, the sites come up very fast. Ingest of articles still slows down the sites but we have some ideas on how to make the ingest much faster.

    I’ve set up stress tests today with WebLOAD. It’s a pretty useful tool with great reporting but it’s not really an Open Source application. You can use WebLOAD up to 100 virtual clients but have to buy a license for anything beyond. But it’s enough to stress test our multi-server stack if I set cache to a minimum. We’ve setup jProfiler to help diagnose the memory problems and hope to have some ideas where the memory leak is occurring soon.

    We’re waiting for the Topaz developers to finish up code to serialize one last cache, move some data into the caches and a accelerate a couple of OTM queries. We should have a fix for search soon and may be able to put basic/advanced search back up on the journal websites soon if testing goes well.

  9. Hi, it looks from your headers that you are using Drupal with Apache on CentOS. I can’t tell what your bottleneck is, but I believe you can improve the performance by 100-200% by putting a caching proxy between the web and the heavyweight Apache+PHP. Try SQUID or even better, Varnishd.

    Unfortunately I have not worked with Drupal, but I worked with MediaWiki. What makes MediaWiki fast is that they use SQUID as a caching proxy. Whenever a wiki page outputs “Cache-control: max-age=3600” it gets cached by SQUID and the next request is served from RAM. On my commodity hardware this means 800 hits/sec instead of 17. By using Varnishd instead of SQUID I was able to reach 1700. I am not sure if Drupal outputs correct Cache-control headers, you may want to hack it.

    You can also gain a bit by putting the static files to a different host, say static.plos.org, and serving them with NGINX or LIGHTTPD rather than Apache.

    Beyond that, you can always get a lot by clustering, putting the site on different machines.

    I love Open Access and I’ll be glad to help make this faster. Really, this is not an acceptable download speed…

    –Anton

  10. I am sorry – I think I put the previous comment in the wrong place. The comment interface is a bit confusing, or maybe I am just confused.

    I also missed the clustering discussion – looks like you are already doing a lot of that.

    But still, I do recommend Varnish and putting some pressure off Apache. It will save quite a few CPU cycles.

    Also, I just discovered that plos.org does not compress content! This must be the reason the site appeared ‘slow’ to me. Look:


    [toyvo@asusie ~]$ HEAD -H "Accept-encoding: gzip, deflate" http://www.plos.org/cms/node/334
    200 OK
    Connection: close
    Date: Tue, 18 Mar 2008 00:15:21 GMT
    Server: Apache/2.0.52 (CentOS)
    Content-Type: text/html; charset=utf-8
    Client-Date: Tue, 18 Mar 2008 00:04:59 GMT
    Client-Peer: 209.237.233.125:80
    Client-Response-Num: 1
    Set-Cookie: PHPSESSID=************************; expires=Thu, 10-Apr-2008 03:48:41 GMT; path=/
    X-Powered-By: PHP/4.3.9

    Please do enable that. For large pages, it reduces the load time considerably. It is not that difficult to install, just use
    mod_deflate.

    Thank you for the good work with PLoS.

    Kind regards,

    –Anton

  11. Hi Anton – thanks for the feedback. Unfortunately, the journal websites with the performance issues run on the TOPAZ platform. Only http://www.plos.org runs on Drupal. After some patches in the last two weeks, the TOPAZ websites serve almost all content from ehcache which gzips all objects in its cache.

    We’ll have to take a look at Varnish. We currently use a Squid proxy with our Apache servers.

  12. Thank you, Richard, for replying. Oh, I see now, sorry about being confused about plos.org and the TOPAZ sites.

    But just to make sure you do not misunderstand me, I was talking about two separate issues. Looks like #2 is still a problem.

    1. HTTP Proxy/Cache, you use SQUID – good; Varnish may be marginally better (up to 1.5 times faster in my experience) but it looks you have more pressing issues at hand; this is no issue then;

    2. GZIP compression of HTML, CSS and the like as it gets send to the browser. I just checked plosone.org – that’s a TOPAZ site, right? And it also does not support this. Another proof is below.

    Most modern browsers will do HTTP requests with ‘Accept-encoding: gzip,deflate’ and expect to be given gzip-compressed HTML in return. It saves bandwidth (at the cost of CPU time), and it typically makes the site appear faster, because pages load faster.

    On-demand compression of HTML is completely different from how you
    cache application objects in ehcache.

    If plosone.org runs Apache, as seems from the headers, you should give a try to mod_deflate. This is a standard Apache 2+ module that does on-demand HTML compression transparently to the application that generates the HTML. You can configure it in 15 minutes and see if it helps.

    From my console:


    [toyvo@asusie ~]$ HEAD -H 'Accept-encoding: gzip,deflate' http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0001092
    200 OK
    Cache-Control: no-cache
    Connection: close
    Date: Thu, 20 Mar 2008 00:09:07 GMT
    Pragma: no-cache
    Server: Apache/2.0.52 (CentOS)
    Content-Type: text/html;charset=UTF-8
    Expires: -1
    Client-Date: Wed, 19 Mar 2008 23:58:30 GMT
    Client-Peer: 209.237.233.125:80
    Client-Response-Num: 1
    Set-Cookie: JSESSIONID=**********************; Path=/

    If content was really encoded, the headers would have included a Content-encoding: gzip or similar header.

    Finally, coming back to #1 (caching), Cache-Control: no-cache seems too restrictive for a static page, which this one appears to be. Relaxing this a bit can give SQUID some leverage and ease the load on the backend (TOPAZ). It can lead to unexpected behavior though if the page has dynamic elements that get updated while the cache does not know about it. This too can be worked around by PURGE requests… This is how Mediawiki does it – and those guys surely have some traffic 🙂

  13. Thank you, Richard, for replying. Oh, I see now, sorry about being confused about plos.org and the TOPAZ sites.

    But just to make sure you do not misunderstand me, I was talking about two separate issues. Looks like #2 is still a problem.

    1. HTTP Proxy/Cache, you use SQUID – good; Varnish may be marginally better (up to 1.5 times faster in my experience) but it looks you have more pressing issues at hand; this is no issue then;

    2. GZIP compression of HTML, CSS and the like as it gets send to the browser. I just checked plosone.org – that’s a TOPAZ site, right? And it also does not support this. Another proof is below.

    Most modern browsers will do HTTP requests with ‘Accept-encoding: gzip,deflate’ and expect to be given gzip-compressed HTML in return. It saves bandwidth (at the cost of CPU time), and it typically makes the site appear faster, because pages load faster.

    On-demand compression of HTML is completely different from how you
    cache application objects in ehcache.

    If plosone.org runs Apache, as seems from the headers, you should give a try to mod_deflate. This is a standard Apache 2+ module that does on-demand HTML compression transparently to the application that generates the HTML. You can configure it in 15 minutes and see if it helps.

    From my console:


    [toyvo@asusie ~]$ HEAD -H 'Accept-encoding: gzip,deflate' http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0001092
    200 OK
    Cache-Control: no-cache
    Connection: close
    Date: Thu, 20 Mar 2008 00:09:07 GMT
    Pragma: no-cache
    Server: Apache/2.0.52 (CentOS)
    Content-Type: text/html;charset=UTF-8
    Expires: -1
    Client-Date: Wed, 19 Mar 2008 23:58:30 GMT
    Client-Peer: 209.237.233.125:80
    Client-Response-Num: 1
    Set-Cookie: JSESSIONID=**********************; Path=/

    If content was really encoded, the headers would have included a Content-encoding: gzip or similar header.

    Finally, coming back to #1 (caching), Cache-Control: no-cache seems too restrictive for a static page, which this one appears to be. Relaxing this a bit can give SQUID some leverage and ease the load on the backend (TOPAZ). It can lead to unexpected behavior though if the page has dynamic elements that get updated while the cache does not know about it. This too can be worked around by PURGE requests… This is how Mediawiki does it – and those guys surely have some traffic 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *


Add your ORCID here. (e.g. 0000-0002-7299-680X)

Back to top