A Very Lively Memory Leak

It’s summer, and we all all know what that means: rain!  And when it rains, things get all leaky.  Memory leaky.

The past couple days, we encountered a particularly interesting memory leak.  We had some code which would leak memory and after around a month or so of running, it would start yielding problems.

Due to the long reaction time, it was a bit difficult to get a hold of initially.  We started out with the best-practice from the literature: make a snapshot using jmap, let the application run for some time (from a couple of days up to a week), make another snapshot, and compare the two using the baseline memory histogram of jhat.

A baseline histogram is supposed to show you which objects have been newly created between the two snapshots, allowing you to find which objects are the most likely culprits.  We did so, but much to our surprise, we didn’t see any object accumulate.  We saw a lot – an increasing number even – of objects, but they were all new.  None were left over from the baseline.

With all objects new, it couldn’t be that we were accumulating objects.  I mean, if we don’t leave any objects behind from earlier, how can we be leaking memory?  This puzzled us for days.  Yesterday – on the train l;ike some poor pleb even – it hit me like a meteor the dinosaurs: I saw model objects in the memory histogram.  I should not see model objects… The figure illustrates a person who enjoys when things leak on him because I don’t really have any figures for this post, and it needed a figure about now.

The application is a relatively simple web-service.  Messages come in, get checked, and processed.  After a message has been processed, it should be gone forever.  I should at most see a couple model objects representing these messages, not enough that they would pop up on the memory histogram, which sorts objects after the number of occurrences.

Two things hit me like the parents an adopted redhead: maybe it’s the entitymanager keeping objects alive or maybe something is iffy with the asynchronous processing that is done?

The entitymanager is used to persist objects to the database, and we had recently replaced the implementation, favoring Hibernate over OpenJPA.  The entity manger tries to convert object graphs to relational data, and for that it maintains some objects in memory so it can be sure to refer to the same objects using the same identifiers.  We do store the messages in a database, so that could explain it.  Checking the code, that could not be it, though.  While we did use Hibernate for storing data, we did not use it for storing the model objects we seemed to be leaking.

The asynchronous processing, then?  The application sends messages to an external application by crating files in the file system.  The other application responds by creating other files.  If the other application doesn’t pick up a file or doesn’t respond to a message (file), this is noticed using a timeout.  Thus, whenever we have delivered a message, we asynchronously start a timer running another context registering the timeout before replying to the caller.

We do this using Apache Camel, and the offending code (spoiler by the way) code looks roughly like this:

[xml] <route id=”processMessage”>

<!– A –>
<inOnly uri=”seda:timeout” />
</route>

<route id=”handleTimeout”>
<from uri=”seda:timeout”>
<!– B –>
<delay>
<constant>5000</constant>
</delay>
<!– C –>

</route>
[/xml]

The code is relatively simple; the first route (ll. 1-5) constitute the main flow. Things are being processed (illustrated by the ellipsis in l. 2), and before finishing the route, we send an asynchronous message to the timeout route (l. 4). We use the inOnly block to send a message without expecting a reply.

To avoid blocking the main thread, we use the SEDA component, which is an efficient in-process message queue, which passes off work to another thread by simply storing it in a simple work queue, emulating event-based communication but taking advantage of shared memory paradigm.

The timeout is handled in an other thread, in the timeout route (ll. 7-15). This route receives a message indicating what to timeout (l. 8), waits for a specified amount of time (ll. 10-12) and then handles the timeout appropriately (indicated by the ellipsis in l. 14).

All good, right? Wrong!

Running the application, we noticed that only some of our messages got a timeout. Weird. Sure, we would expect that the timeout would take some time, but not long enough that we would observe that the border between messages that had timed out and those that hadn’t moving. We would send in 100 messages in a few seconds, and would expect to see them all not timed out and then 5 seconds later see them all timed out. That’s not what happened.

We enabled logging at points A, B, and C. We kept the logging timestmped so we could follow what happened. We then saw that while all points A would happen roughly simultaneously (as expected), the points B and C were drifting. Sure, point C would happen roughly 5 seconds after point B, but the timespan between a point A and a corresponding point B would drift. This matches our observation: we was the frontier slowly moving ahead.

That’s when it hit me for the third time like a lightening knowing that it never strikes the same place twice: we would start the timeout asynchronously in another thread, with stress on AN. Literally one. All the timeouts were handled by a single thread, and timeout for message 2 could not start before the timeout for message 1 has run out.  The picture is only related in the most superficial manner possible.  Pretty, though.

If messages come in at a rate faster than the timeout, we would accumulate timeouts in the queue between the components. As mentioned, each of these would carry a copy of the message sent so they could identify the correct message to relate the timeout to.

And that’s why we were not seeing messages accumulating. Technically, they were not. We would eventually process all messages, but at the same time we would accumulate more and more pending work. It’s like a bucket with a hole in the bottom. If we pour water in at a greater rate than it can flow out, we will accumulate water in the bucket. This is also a great analogy for memory leaks because the literal leak from the hole in the bottom doesn’t correspond to the memory leak at all, but instead the desired behavior. Rather, what corresponds to the memory leak is the bucketness of the situation and the lack of a large enough leak.

On our test systems, we were using a timeout of 5 seconds, and mostly would send in messages manually, meaning we would never accumulate timeouts at noticeable numbers. Furthermore, timeouts are the exceptional state and only provoke action if a proper response isn’t received within the timeout, even if we accumulated a few timeouts, we would never notice if these messages were properly acknowledged within the now wider timeframe.  The picture is a first page GIS result for “timeout.”  Your guess is as good as mine.

On production systems, we would instead get roughly a message every 3 seconds. Each message would have a timeout of 60 seconds (really, we would start 3 timeouts of 60 seconds for reasons that are entirely irrelevant here). That means that each message would occupy the single timeout thread for 180 seconds. During this time, we would receive 60 messages, meaning that we would effectively accumulate 59 messages every 180 seconds (in-flow of 60 messages, out-flow of one). That’s the leak.

We could fix this in two ways: either increase the number of threads; by setting the number of threads to 100, we could process 100 messages in an 180 second period, allowing an out-flow of 100 with an in-flow of 60. This solution has its disadvantages (for example, I know the customer wants to increase the message rate to one every second, for an in-flow of 180 every 180 seconds), and in general it means that it will be possible to flood the system if desired. Also, it’s just inelegant.

More elegant was to just make the timer asynchronous as well. That way, the timeout thread would not be occupied during the delay, but only during the actual processing of the timeout. This is simply done in Camel by replacing lines 10-12 by:

[xml] <delay asyncDelayed=”true”>
<constant>5000</constant>
</delay>
[/xml]

Luckily, we caught this on an acceptance test environment before it caused issues in production. The acceptance environment was on purpose configured to handle the same data load as the production system, but was configured with less memory to catch exactly this kind of situation on this environment before it would become a problem in reality.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.