I’m running some experiments with a parallel implementation of state-space generation and observed something very interesting.
I was running the experiments on my own computer (iMac, Core i7 2.8 GHz w/ 4 cores (8 HT) and 8 GiB 1067 MHz RAM). The experiment has been running for almost 4 days now and have just about exhausted the 4 GiB RAM I allocated for it. I have limited it to running 5 computation threads and one coordinator thread as I also need to be able to use it for other things. As the exploration is not entirely done yet, I decided to also run the exact same program on our grid at work. I am using one node (dual Xeon L5520 2.23 GHz for a total of 8 cores (16 HT) and 12 GiB RAM). As I am only using the machine for experiments, I allowed it to use 10 GiB RAM and 20 computation threads (each computation has some overhead in addition to the 10 GiB, so it probably ends up exhausting all memory).
State-space generation is about explicitly (or symbolically in some cases, but not in this case) representing an implicitly given graph. Basically, the graph is given as an initial node and a function that given a node returns all successors. State-space analysis is useful for exhaustive testing of complex systems, and that is exactly what I do. A simple single-threaded implementation of state-space generation is
[raw] GenerateStateSpace(State initial)
states := new HashTable()
stack := new Stack()
State s = Stack.pop()
for (State ss : s.successors())
[/raw] This algorithm implements a simple depth-first traversal of the state-space, which is often a good way to ensure that problems are found fast (in my example, all problems were found within a minute of a computation that has been running for almost four days without terminating yet). We can make a breadth-first traversal by using a (FIFO) queue instead of a stack and we can implement a custom traversal, e.g., an A*-heuristics, by using a priority queue.
We make this parallel by having multiple threads execute the while-loop in lines 6-10 and simply synchronize on access to states and stack. Most time is spent computing successors in line 8, so with is expected to scale reasonably well.
To get back to the amusing thing I found in my example. I am just running 5 and 20 simultaneous versions of the algorithm above. Everything is the same except for the number of computation threads, which is in both cases selected to be the number of threads achieving the best performance on each machine. For each experiment, I make a number of measurements to follow the computation. Measurements include the size of states, stack and the total number of successors calculated (which is often larger than the size of states, as a state may be successor of multiple other states).
In this chart, I have plotted the size of states as a function of the time (in minutes). I have zoomed in on the prefix of the chart, as the computation on the grid has only been running for 16 hours and disappears compared with the execution running for 4 days. We see that both computations add states to states at a more or less linear rate (aside from a brief hiccup after around 30 hours in the 5 threads case). As expected, running on the computer with 2 CPUs is faster but not by much (each CPU is slower).
The amount worked is not really measured by the number of states added to states but by the number of successors computed. For this reason, here is a plot of the number of successors computed in the last minute as a function of the total computation. It shows the same trend: both algorithms progress the same way, but the one running with 20 threads is a bit faster, computing around 14,000 successors a minute compared with around 9,000 successors a minute.
Now, of course, we’d expect the progression of the stack to also be similar for the two cases, right? Well, that is the surprising thing. The size of the stack is completely different for the two cases. In the following charts, we have plotted the size of stack as function of the time. The first shows the full range and the second has zoomed in on the period of time also used on the grid. The third chart uses a logarithmic scale on the 2nd axis to blow up the small numbers, and the fourth chart restricts the values on the y-axis.
Interesting to notice is the fact that for 5 threads, the stack increases steadily until around 1600 minutes in (with a hiccup at 800 minutes in) and then changes character, still growing more or less steadily but at a lower rate. For 20 threads, on the other hand, the stack does not increase. Rather is stays more or less constantly at 1200-1500 states, fluctuating a bit up and down, with a steady but slow decrease.
As the charts for the storage and computed successors the algorithm perform more or less the same regardless of the number of threads, performance differences is not the explanation for the difference in the size of stack. I haven’t investigated the differences in detail, but my suspicion is that they stem from differences in the traversal.
When describing the algorithm, I said that using a stack makes the traversal depth-first and using a queue makes it breadth-first. This is also true as long as we have just one worker thread. When we have more worker threads, we do in fact look at more than just the top element of the stack, namely one element for each thread, allowing processing states closer to the bottom of the stack that would otherwise be possible, making the traversal closer to a combination of depth-first and breadth-first. We traverse the state space depth-first, but we do not investigate successors one at a time, but rather 20 at a time (with 20 worker threads), giving a certain breadth as well. It is well-known that the traversal strategy has huge impact on the size of the stack data-structure (some state-spaces can end up having almost all of the state-space on the stack when doing a depth-first traversal if they are very connected and some end up having huge parts of the state space when doing breadth-first traversal if they have a lot of states with the same minimal distance from the initial state).
Now, the next step is, of course, to investigate the characteristics of the progression of the size of the stack as a function of the number of worker threads (but I doubt I’ll have the patience to run experiments lasting around a week with 1-50 worker threads…)
Anyone else seen this behavior or have ideas as to what causes it?
Time person of the year 2006, Nobel Peace Prize winner 2012.