I have been wanting to blog about an experience I had not too long ago on a project where the jvm was consistently throwing OOM errors. It had me banging my head against my desk for a few days attempting to trace where the culprit was.
Was it code related?... yes, was it bad code?... sort of, was it very very very intense code (looping over many sql calls and instantiating many objects)?... YES! The challenge here was to apply a quick band aid rather than redesign this rules engine that had the characteristics listed due to time constraints. It is important to note I was not responsible for the poorly written code 8-).
I had to first identify if there were memory leaks due to this code. To do this I utilized YourKit Java Profiler. They have .NET and Java profilers that allow you to monitor their respective runtimes. YourKit Profiler is very simple to configure within the jvm.config file (I'm not going to get into that in this blog). My point here is that I witnessed memory steadily climb and at times spike, but was only able to reclaim memory when executing a manual GC. UGGG!!! So, no memory leak, but the runtime was hanging on to what it had... Why was the GC not reclaiming memory quickly on it's own? I had all the BP jvm args of old, etc., etc. ParNewGC and RMI to no avail.
And to get back to the initial issue; the blasted OOM error. I searched high and low and identified a thread on a Sun forum that there were issues with jre 1.5 that ultimately threw a OOM error if the runtime was unable to reclaim memory during a GC within a given time frame. The workaround for this was to set a time constraint in the JVM (which didn't work) or install 1.6_10 or later. This was my first step. I installed this JRE version and pointed my jvm.config to it. The application ran fine under this JRE except that I was still seeing the memory creep to the ceiling with no reclaim.
I then read on one of Sun's GC tuning white papers the following paragraph:
The -XX:+AggressiveHeap option inspects the machine resources (size of memory and number of processors) and attempts to set various parameters to be optimal for long-running, memory allocation-intensive jobs. It was originally intended for machines with large amounts of memory and a large number of CPUs, but in the J2SE platform, version 1.4.1 and later it has shown itself to be useful even on four processor machines. With this option the throughput collector (-XX:+UseParallelGC) is used along with adaptive sizing (-XX:+UseAdaptiveSizePolicy). The physical memory on the machines must be at least 256MB before AggressiveHeap can be used. The size of the initial heap is calculated based on the size of the physical memory and attempts to make maximal use of the physical memory for the heap (i.e., the algorithms attempt to use heaps nearly as large as the total physical memory).
Note: -XX:+UseAdaptiveSizePolicy is on by default so I don't explicitly define it in my args.
Amazingly, once I added this to the args, removed ParNewGC (enabled UseParallelGC) the server ran flawlessly for days and days without a restart. I was serving requests into the millions without a restart!!! A partial arg list specific to these setting are below, please let me know your thoughts and concerns as I always enjoy constructive feedback.
java.args=-server -Xmx1024m -Xms1024m -XX:+AggressiveHeap -XX:+UseParallelGC -Dsun.io.useCanonCaches=false -XX:MaxPermSize=512m
Note: This was for a ColdFusion 8 instance.
Sorry to hear. Could you give me some details on your server config.? I do not recommend implementing the settings I have listed if you are not running CF 8 with JRE 1.6_10 or greater.
Any chance you can send me the files? Hit me up on the contact us form on strikefish.com if you like.