How to compress repeating branches in directed graph?

I work a lot with directed graphs resulting from heap dumps of Java programs. One thing that characterizes them is that they contain lots of repeating patterns. I would like to find a way of compressing such patterns whilst still retaining the essential structure of the graph. For instance consider the following "molecules":

  |         |
  A1        A5
 / \       / \
B2  C4     B6 C8
 \ /       \ /
  D3        D7

the letter represents the type of the object and the number represents its unique id (or dfnum). Clearly the second molecule is a repeat of the first just with different ids. From a heap analysis point of view, the actual ids are unimportant so you could replace the molecule at A5 with something that said effectively "another copy of A1". On decompression (for input to heap analysers for instance) you could just assign arbitrary unique ids.

I could spot such patterns by maintaining a hash of the type of the objects during the dfs of the graph. So the hash for A1 would contain (for instance) "A^B^C^D" and this would match that for A5.

The problem I have is with molecules that point to some external molecule. By external I mean something that has been visited earlier in the dfs. For instance (sorry about the ugly ascii graphics):

  |         |
  A1        A5
 / \       / \
B2  C4    B6  C7
 \   \   /   /
  \   \ /   /
   \  | |  /
    \ | | /
     \| |/
       D3

for this situation when I come to descend from A5 I find that D3 has already been visited. So I would like the hash code for A5 to represent the unique value for D3 rather than just the type. I.e. something like "A^B^C^D3". So on compression/decompression we are differentiating A5 as being a copy of A1 rather than some other A whose B and C point to some other D.

My question is whether there are any tricks to do such a calculation, i.e. how to tell that D is "outside" our molecule whose root is A? This also has to be done in an efficient manner, preferably with one or two dfs's. (Heapdumps contain tens of millions of objects). You don't know in advance that A is a candidate so it probably needs to be an algorithm that does it during the dfs. Maybe something dominator tree related?

Hope that makes sense!

Edit: updated diagrams to clarify the fact that A1 and A5 themselves have parents and are just arbitrary nodes discovered during a DFS walk of the entire graph.

Clarification: for my purposes it's not important that the match is 100% guaranteed. By using hash codes as above I am happy to take the very small chance that there will be a hash collision and the algorithm will mistakenly classify a molecule incorrectly. Because such collisons will be rare they are unlikely to much affect the overall picture.


ANSWERS:


Without thinking about your problem in too much depth, I have solved my problem of compressing web data for my research using a trie. You should be able to serialize your data onto a trie for the purposes of compression.


From what I can tell, this is likely to be related to the graph isomorphism problem. The wikipedia page has a few pointers to current approaches to this problem. However, from a brief skimming, most of these seem to be designed to consider two entire graphs, and not look for isomorphic subgraphs within a larger graph.

With respect to search algorithms, my gut feeling is that depth-first search is not the right approach for this problem. You might think for a bit about what a breadth-first traversal might do. At least in the specific example you describe, that would allow you to look at A1 and A5 first, before committing to a specific "molecule" shape at either one.


This is probably a too abstract answer to actually help, but you can create a subgraph per repeated pattern, then collapse each pattern occurrence to a single node (with a pointer to the respective subgraph structure). The node would manage all edges connected to the pattern. Such edges must also remember the node of the pattern they connect to, so you would be able to offer graph traversals which hide the details of the representation, but traverse the graph as if it was "as it meant to be". This would be complicated if you fail to abstract the internal representation and your algorithms need to understand nested graphs.

As a side note, while graph isomorphism is generally a tough problem, in your case you have so much metadata on your graph (e.g. object types, field names, etc), which is equivalent to having a labeled graph, with very rare, selective labels. Such labels greatly prune the required effort to find isomorphic patterns (up to a small size of course, otherwise your pattern cache would fill all memory).

Since there are going to be lots of objects that follow closely the class definitions (if there was no inheritance, objects would be structs and their runtime types would exactly match the definitions), I predict that what you're trying to do has lots of potential to significantly compress the object graph.



 MORE:


 ? Best way to test for total reachability
 ? Graph theory question, Java. Which algorithm to achieve the following
 ? Implementation Detail for Graph Analysis Algorithms
 ? Partitioning of a directed graph
 ? Longest circle in graphs
 ? Find cycle of shortest length in a directed graph with positive weights
 ? graph - How to find Minimum Directed Cycle (minimum total weight)?
 ? Find the minimum weight cycle that contains a specific edge
 ? an algorithm to choose the best edge to reinsert after deleting
 ? an algorithm to choose the best edge to reinsert after deleting