Theoretically, what data structure can I use for trees with shared memory?

Question

Real world problem

I have a forest of trees. Like 20,000 trees. This forest occupies too much memory. But these trees are similar - you could find groups of trees (for ~200 trees) so that they have a common subtree of quite a significant size (tens of %).

Theory

So knowing that:

Trees are similar ie they share a common connected subgraph including the root (not necessarily including the leaves - but possibly).

Does there exist any data structure that allows for efficient storing of that information? Once the structure is created, I'm only interested in reading .

It doesn't necessarily be a solution tight to .NET, I could code it from scratch, I just need the idea :D But of course, if there is some little-known structure in .NET that kind of achieves that, I would be pleased to know.

I have a feeling that this shared memory stuff may have something to do with immutable structures that by definition are expected to share memory...

My trees are not binary search trees, unfortunately. They can have any amount of children.

Reading

As for reading, it is quite simple. I am always navigating from the root to a leaf . As you would in any JSON or XML, given an exact path to a value.

Nature of similarity

The connected subgraph including the root that is same (potentially) among two trees always contains the root and spans down. In some cases it is possible to even reach the leaves. See an example (the yellow part is the connected subgraph including the root ):

Given these rules, mathematically speaking all the trees are similar - the connected subgraph is either empty, or it contains only the root, or inductively - it contains the root and its children...

Answer 1

You can group children of your tree node by different "owners". When you add a node, you specify owner (or null to use default "shared" owner). When you traverse your tree, you also specify owner. Here is a sketch code:

class TreeNode {
    protected static readonly object SharedOwner = new object();
}

class TreeNode<T> : TreeNode {        
    private readonly T _data;
    private readonly Dictionary<object, List<TreeNode<T>>> _children;

    public TreeNode(T data) {
        this._data = data;
        _children = new Dictionary<object, List<TreeNode<T>>>();
    }

    public TreeNode<T> AddChild(T data, object owner = null) {
        if (owner == null)
            owner = SharedOwner;
        if (!_children.ContainsKey(owner))
            _children.Add(owner, new List<TreeNode<T>>());
        var added = new TreeNode<T>(data);
        _children[owner].Add(added);
        return added;
    }

    public void Traverse(Action<T> visitor, object owner = null) {
        TraverseRecursive(this, visitor, owner);
    }

    private void TraverseRecursive(TreeNode<T> node, Action<T> visitor, object owner = null) {
        visitor(node._data);
        // first traverse "shared" owner's nodes
        if (node._children.ContainsKey(SharedOwner)) {
            foreach (var sharedNode in node._children[SharedOwner]) {
                TraverseRecursive(sharedNode, visitor, owner);
            }
        }
        // then real owner's nodes
        if (owner != null && owner != SharedOwner && node._children.ContainsKey(owner)) {
            foreach (var localNode in node._children[owner]) {
                TraverseRecursive(localNode, visitor, owner);
            }
        }
    }
}

And a sample usage:

class Program {
    static void Main(string[] args) {
        // this is shared part
        var shared = new TreeNode<string>("1");
        var leaf1 = shared.AddChild("1.1").AddChild("1.1.1");
        var leaf2 = shared.AddChild("1.2").AddChild("1.2.1");
        var firstOwner = new object();
        var secondOwner = new object();
        // here we branch first time
        leaf1.AddChild("1.1.1.1", firstOwner);
        leaf2.AddChild("1.2.1.1", firstOwner);
        // and here another branch
        leaf1.AddChild("1.1.1.2", secondOwner);
        leaf2.AddChild("1.2.1.2", secondOwner);
        shared.Traverse(Console.WriteLine, firstOwner);
        shared.Traverse(Console.WriteLine, secondOwner);
        Console.ReadKey();
    }        
}

Answer 2

The problem with "reusing" a part of a tree with different leaves is that you need to provide additional information about how to map the leaves of the common part to different graphs. Since your search can end up in any node within the common part, this means you need to map each node in this common subtree to "actual" nodes inside each graphs.

For example, these two "similar" trees A and B share a common part of a subtree (nodes 1 , 3 , 6 , 7 , 8 ):

To reuse the "common part", you would do something like:

Does this provide any space savings? Well, if knowing A and 3 means you can directly "calculate" A3 without a need for a lookup, then in this particular example, you wouldn't need to map "inner" common nodes 3 and 6 for any of the graphs, saving a bit of space.

In other words, if these common subtrees don't only share their structure, but also their content, then you only need to map exit nodes (leaves) to separate graph nodes.

(Update)

For completeness sake, I've added a diagram of @Evk's implementation , which stores lookup tables inside the actual nodes. Space wise, it shouldn't be different, but since you have a working example in that answer, it might be useful to visualize it:

Since you know the details of the actual data you're dealing with, you might be able to squeeze a bit of space here and there, but my recommendation would still be to either:

Add more RAM to the machine, or
Use a disk-based tree, potentially a b-tree, even better if using an SSD.

Answer 3

If I understand your problem, part of the solution is to have roots of the subtrees shared by several trees, and information in the leaves that tells to which tree a leave belongs. The way to arrange this information depends on the kind of queries you need to perform.

With the new explanation, I understand that you need to represent the maximal tree and enhance the nodes with a "stop list" that indicates which among the partial trees stops at this node, ie doesn't share more descendants.

Once again, the appropriate data structure for the stop list depends on the access pattern.

It is highly probable that this repesentation be less compact than the simple forest of trees.

Answer 4

Have you tried AVL Trees (auto-balancing binary trees) yet? If not, this data structure is efficient in such situations.

Theoretically, what data structure can I use for trees with shared memory?

Question

Real world problem

Theory

Reading

Nature of similarity

4 answers

solution1
3 ACCPTED 2016-05-06 08:59:22

solution2
1 2016-05-06 08:53:36

solution3
0 2016-05-06 07:16:37

solution4
-3 2016-05-06 06:27:09

Theoretically, what data structure can I use for trees with shared memory?

Question

Real world problem

Theory

Reading

Nature of similarity

4 answers

solution1 3 ACCPTED 2016-05-06 08:59:22

solution2 1 2016-05-06 08:53:36

solution3 0 2016-05-06 07:16:37

solution4 -3 2016-05-06 06:27:09

solution1
3 ACCPTED 2016-05-06 08:59:22

solution2
1 2016-05-06 08:53:36

solution3
0 2016-05-06 07:16:37

solution4
-3 2016-05-06 06:27:09