The goal of our transformation on the IMDB data set was to find all pairs of persons (actors or actresses) which starred together in at least three movies. For every such pair we want to create an instance of the Couple class. We used the Henshin rule below to do this. First, note that all nodes / edges have *-actions and thus the rule is applied to all found matches at once. Second, we used <<require*>>-actions for the movies, and not <<preserve*>>. This is important because we want to ensure the existence of these movies, but be don't want to generate separate matches for every combination of movies. There is only one small subtlety in this rule: due to the symmetry of person nodes, we will actually create two couple nodes for every pair. But this could be easily fixed by enforcing an alphabetical order of the persons' names using an attribute condition.
To compare the performance of the Giraph-based engine, we also tried to execute this example with the normal Henshin interpreter. To our surprise, even for a small part of the IMDB data set, the Henshin interpreter was not able to compute a match. To study the problem in a more controlled setting, we defined a small transformation that generates a synthetic test data set of variable size. We then ran the interpreter again and got the following numbers.
Nodes | Couples | Time (ms) |
1,700 | 200 | 1,625 |
3,400 | 400 | 3,753 |
5,100 | 600 | 7,828 |
6,800 | 800 | 14,386 |
8,500 | 1,000 | 23,301 |
This looks pretty bad. The input graphs are very small (< 10,000 nodes) and it already takes more than 20 seconds. The main problem is that the execution time grows non-linearly (should be O(n2)) with the size of the input graph. What is the reason for this? Well, the Henshin interpreter first fixes a match for the LHS (<<preserve>> and <<delete>> elements), and after that it tries to match the application conditions (<<require>> and <<forbid>> elements). In our example, this means that it first picks to arbitrary persons, and then checks whether they starred together in 3 movies. Thus. it is basically a brute-force approach that the interpreter runs here.
The pattern matching is implemented in Henshin as a constraint-solving problem. To fix the bad performance of the movies example (and similar problems), we added a new type of constraint to the pattern matching algorithm: path constraints. Path constraints are internally used to check whether there exists a path between two EObjects along a given list of EReferences. The neat thing about these paths is that they can also go via <<require>> nodes. In our example, this indeed works because the movies-reference has an opposite reference: persons. If we now fix one of the two persons, the path constraint via the references [movies, persons] allows the match finder to restrict the solution set for the second person to only those persons who played together with the first one in at least one movie. This drastically prunes the search space which gives us a much better performance.
Nodes | Couples | Time (ms) |
1,700 | 200 | 133 |
3,400 | 400 | 89 |
5,100 | 600 | 35 |
6,800 | 800 | 36 |
8,500 | 1,000 | 49 |
... | ... | ... |
170,000 | 20,000 | 1,419 |
850,000 | 100,000 | 6,183 |
1,700,000 | 200,000 | 9,836 |
This looks much better: now we can process 1.7 million nodes in less than 10 seconds. This should be also enough to run the example on the original IMDB data set. The example including the benchmark is part of the examples plug-in of the development version of Henshin.
I wanted to reproduce the samples, so I got a Henshin nightly build and the examples plugin. When I run the example, I get similar results. Good job. :-)
ReplyDeleteNow I want to solve that example with another tool. Therefore, I need to serialize the models. But it takes ages using standard XMIResourceImpls. Ok, no problem, I've written my own simple XMI export which is several times faster, but still it sucks badly. The problem is that iterating over EGraph.getRoots() takes about sixteen seconds even for the smallest model with only 170.000 nodes.
So how can it be that this exact model can be transformed in about 2 seconds when just iterating all nodes once already takes 16 seconds?!? Obviously, I must be doing something wrong...
[As another data point: for the model with 1.190.000 nodes, iterating all the elements took 1172 seconds!]
I've just had a look at EGraphImpl.getRoots(), and the problem is that it does a contains-check on an ArrayList for any root, so it's O(n^2) in the worst case. And in the movie models every node is a root, so here we have the worst possible scenario.
ReplyDeleteOk, anyway, now I've seen that EGraphImpl extends LinkedHashSet, so I can work around that problem.
Great tip, thanks a lot! We should definitely fix this problem in getRoots().
ReplyDeleteHappy holidays!
Hi Christian. Today I found the time to do that transformation also with FunnyQT and to compare the two solutions in a blog post.
ReplyDeletehttp://tsdh.wordpress.com/2014/01/03/finding-movie-couples-henshin-vs-funnyqt/