How Much Memory is Needed to Store a Log?

I’ve been looking a bit at log representations in ProM. We are using the XES standard and the OpenXES implementation. This was prompted by me wondering why ProM was so slow at handling logs. Many of the things ProM did with logs, I could replicate in a short time (i.e., less than an hour) in a way that was 1000s of times faster. This was mostly really simple things like filtering logs and what have you.

The reason is that ProM uses an external representation of log because logs can be huge and not necessarily possible to represent internally. This is in theory very sound reasoning, but neglects the first rule of external programming: “Don’t – unless you have to.” I quickly created a plug-in for ProM that switches the representation form the external representation to the so-called naive internal one.

My initial results were encouraging; I noticed a simple translation going from 17 seconds to 10 seconds. The translation consists of a pre-processing stage taking 7-8 seconds followed by a phase for translating my internal data-structure to an XES log. This means that the phase including XES went from 9-10 seconds to 2-3 seconds, a speed-up of around a factor 4. Again, naive wins over clever. These experiments are all performed on my recent Macbook Pro, which has a decently speedy SSD, so I anticipate the difference being much more visible on an old-fashioned, low-end hard disk.

I then decided to take a look at the memory consumption. Much to my chagrin, I noticed ProM using around 1.5 GiB for my original data and two representations of the log, one internal and one external. Ok, this must mean that the reason for using the external implementation is that the internal one consumes more memory than drunken alzheimer.

Firing up my profiler to investigate yielded interesting results. First, my source data, which contains more information about the log than the log itself, but in a highly unstructured way, takes up a total of around 90 MB of memory. I got embarrassed about this and quickly whipped up an implementation based on the Trove Maps using around 74 MB of memory.

I then started comparing the internal and external log representation’s memory consumption and found them to be more or less the same. During this, I did discover something, though: It seems that a lot of memory was used to display the log and not to actually store it. I thus made this table, comparing the difference in memory, starting from a completely freshly started ProM with an empty workspace compared to loading my source data-structure, then compared to after converting my data-structure to an XES log, and to after deleting the log visualization. The last row shows how much the last state consumes compared to just after loading my data-structure, i.e., the amount of memory used just for the log without the visualization. + means the memory use goes up, – means it goes down and all are in MB.

	Internal	External
Key/Value Loaded	+136.6	+136.8
Log Generated	+586.5	+590.3
Visualization Deleted	-246.3	-252.9
Log Without Visualization	+340.1	+337.3

We see that my initial data-structure (in this configuration) takes up around 136 MB, the log data-structures, both internal and external, around 340 MB, and the visualization around 250 MB. My log is not large, so maybe we just do not benefit from the external data-structure. Sometime, I should probably check this on a larger log, but my log is not exactly small (50,000 traces, 120,000 total events), so I anticipate 90% of the logs ever used in ProM will be smaller.

Next, I remembered the classic trick I also employed earlier, namely pointer compression. Switching this on yields:

	Standard Settings		Compressed Pointers
	Internal	External	Internal	External
Key/Value Loaded	+136.6	+136.8	+97.4	+97.9
Log Generated	+586.5	+590.3	+414.8	+419.1
Visualization Deleted	-246.3	-252.9	-184.1	-0
Log Without Visualization	+340.1	+337.3	+230.6	+419.1 / +235.0

We see everything becomes smaller, but the overall picture is the same. The only exception is that the visualization does not seem to be removed from memory for the external representation with optimized pointers. This happens sporadically, so I have just used the value from the other column in these cases. It does indicate, though, that ProM is not very good at releasing memory when removing a visualization, and given it takes up 180-250 MB,depending on pointer compression for a moderately sized log, this can be problematic. I have run all tests measuring only strongly reachable objects, so caching and garbage collection has no influence.

Ignoring this, I went looking at the largest objects, and found that the serializer thread uses a lot of memory. While I like the idea of everything being automatically serialized and available, I hate the serialization thread! It locks the application when serializing large objects, crashes ProM if you accidentally include a pointer to an object in the GUI or a thread inside a serialized object, and makes it really difficult to code objects that are not just simple data-structures (e.g., anything using external code). Anyway, I got another reason to dislike the serializer thread. I discovered, it took up hundreds of MB of memory, so I got rid of it and redid my experiments:

	Serializer				No Serializer
	Standard Settings		Compressed Pointers		Compressed Pointers
	Internal	External	Internal	External	Internal	External
Key/Value Loaded	+136.6	+136.8	+97.4	+97.9	+75.7	+75.3
Log Generated	+586.5	+590.3	+414.8	+419.1	+281.3	+336.1
Visualization Deleted	-246.3	-252.9	-184.1	-0	-184.0	-0
Log Without Visualization	+340.1	+337.3	+230.6	+419.1 / +235.0	+97.1	+336.1 / +152

Things look much better, and we discover that now the internal representation performs significantly better than the external one. It uses just below 100 MB where the external one uses 150 MB.

Quite surprisingly, at least to me, I could not get any improvement by replacing the data-structures in the log representations using the more efficient Trove implementations. Investigating the memory use, I did discover that the fact that attributes can have attributes is really expensive; if we look at the objects taking up the most space after moving all objects but the log from memory, we get:

The char arrays stem from the strings containing the up to three attributes on each of the around 120,000 events in the log. The HashMap$Entry are from the hash tables containing the up to four attributes of each event, and are the reason, I’m surprised switching to a Trove map does not improve consumption. Next, the XAttributeMapLazyImpl is a hack to use less memory when an object contains no attributes. As this is all ~400,000 attributes, all events, all traces, and all logs, this is a lot. By disallowing attributes from having attributes, we can remove this. Strings we know why we have, and the two XAttributes (Literal and Timestamp further below) are objects abstracting attributes. If attributes can no longer have attributes, we can just have each event, trace and log contain key/value mappings, freeing up this memory as well. I think it would also make it possible to get rid of most of the HashMap$Entry objects.

Thus, if we remove attributes from attributes and allow arbitrary objects as values, we would be able to store the log using approximately 37 MB less, for a total of only 60 MB. Extrapolating, we could thus store a log with 3.2 million traces and 8 million events in 4 GB memory. We could probably save even more by using a flyweight implementation of logs.

In conclusion, I think we should get rid of the external implementation and make it a manual option to switch it on if needed. I have not, with my log, been able to find a single scenario where the external implementation uses less memory than the internal one, and the internal implementation is several orders of magnitude faster. Furthermore, the visualization of logs should not be stored if at all possible, as it uses way too much memory (around twice that of the actual log!), making the external representation completely moot, because the visualization will use that much memory even with the external representation. We could for example remove the list of all traces or make it dynamically computed. We should also ensure that visualizations are really removed. The default of ProM for machines with less than 32 GB of memory (64 GB for new versions of Hotspot) should be to use pointer compression, since most data-structures in ProM consist of lots of small objects.

Also, it would be interesting to look at an XES implementation using SQLite and generic SQL databases.