The goal of our transformation on the IMDB data set was to find all pairs of persons (actors or actresses) which starred together in at least three movies. For every such pair we want to create an instance of the Couple class. We used the Henshin rule below to do this. First, note that all nodes / edges have *-actions and thus the rule is applied to all found matches at once. Second, we used <<require*>>-actions for the movies, and not <<preserve*>>. This is important because we want to ensure the existence of these movies, but be don't want to generate separate matches for every combination of movies. There is only one small subtlety in this rule: due to the symmetry of person nodes, we will actually create two couple nodes for every pair. But this could be easily fixed by enforcing an alphabetical order of the persons' names using an attribute condition.
To compare the performance of the Giraph-based engine, we also tried to execute this example with the normal Henshin interpreter. To our surprise, even for a small part of the IMDB data set, the Henshin interpreter was not able to compute a match. To study the problem in a more controlled setting, we defined a small transformation that generates a synthetic test data set of variable size. We then ran the interpreter again and got the following numbers.
Nodes | Couples | Time (ms) |
1,700 | 200 | 1,625 |
3,400 | 400 | 3,753 |
5,100 | 600 | 7,828 |
6,800 | 800 | 14,386 |
8,500 | 1,000 | 23,301 |
This looks pretty bad. The input graphs are very small (< 10,000 nodes) and it already takes more than 20 seconds. The main problem is that the execution time grows non-linearly (should be O(n2)) with the size of the input graph. What is the reason for this? Well, the Henshin interpreter first fixes a match for the LHS (<<preserve>> and <<delete>> elements), and after that it tries to match the application conditions (<<require>> and <<forbid>> elements). In our example, this means that it first picks to arbitrary persons, and then checks whether they starred together in 3 movies. Thus. it is basically a brute-force approach that the interpreter runs here.
The pattern matching is implemented in Henshin as a constraint-solving problem. To fix the bad performance of the movies example (and similar problems), we added a new type of constraint to the pattern matching algorithm: path constraints. Path constraints are internally used to check whether there exists a path between two EObjects along a given list of EReferences. The neat thing about these paths is that they can also go via <<require>> nodes. In our example, this indeed works because the movies-reference has an opposite reference: persons. If we now fix one of the two persons, the path constraint via the references [movies, persons] allows the match finder to restrict the solution set for the second person to only those persons who played together with the first one in at least one movie. This drastically prunes the search space which gives us a much better performance.
Nodes | Couples | Time (ms) |
1,700 | 200 | 133 |
3,400 | 400 | 89 |
5,100 | 600 | 35 |
6,800 | 800 | 36 |
8,500 | 1,000 | 49 |
... | ... | ... |
170,000 | 20,000 | 1,419 |
850,000 | 100,000 | 6,183 |
1,700,000 | 200,000 | 9,836 |
This looks much better: now we can process 1.7 million nodes in less than 10 seconds. This should be also enough to run the example on the original IMDB data set. The example including the benchmark is part of the examples plug-in of the development version of Henshin.