High performance many-to-many relationships in python

Question

Given cluster and node objects:

class Cluster():
   def __init__(self):
       pass


class Node():
    def __init__(self):
        pass

I am wondering what is the best data structure or design that meets the following requirements:

Find all the clusters that a given node belongs to.
Find all the nodes that belong to a given cluster .
Keep track a numerical quantity that represents how much each node belongs to a cluster , and each cluster to a node .
Ensure consistency when a node or cluster is deleted or added.
Fast lookups, additions and deletes. (in that order)
Low memory requirements.

The number of nodes and clusters will each be in the range of 100,000.

More details of varying relevance:

A node will always belong to one or more clusters,
A cluster will always contain one or more nodes.
If a cluster has its only node removed the cluster should be deleted.
A node will never have all of its clusters removed.
Example: node1 might belong 90% to cluster14 and 10% to cluster88

I was thinking about using SQLite, but the problem is that storing serialized objects in the database is too slow. I could store object_ids in the database and then look those up in a dict that maps object_ids to object instances, but then there are consistency issues between the dict and the database. Additionally fetching a list of instances from the dict is a bit cumbersome.

I could possibly store the memory locations of the instances in SQLite but that seems dangerous, and we still have consistancy issues.

Answer 1

I implemented a similar data structure on a home project ; my own requirements called for a look alike architecture, except i called cluster "tags" (but the core concept is the same).

Here is how you may implement it:

A list of clusters' names (or classes)
A dictionnary of lists. In this dictionnary, keys are bitmasks marking the fact you belonging to a given set of clusters, and values are all the corresponding nodes. Say, if you have cluster 1 to 4, and Node42 belonds to cluster 1 and 3, the dictionnary will have an entry looking like 5:[Node42, ...]
A dictionnary of singletons (it's an optional memory optimisation, since a set weight around 130 bytes in python if i remember correctly, having a dict that adress directly singletons help reduce memory consumption)

About requirements :

Depends, it is O(n) in my initial architecture, but with additional memory consumption you can have this instantly : add a field to each node with it's corresponding key in the dictionnary, then you just have to do the clusters lookups with the mask
O(n) : you have to read the data structure and aggregate nodes belonging to a given cluster. Best cases are fast, but heavily fragmented structure will be slow. For the same price, you can implement lookup of union and intersection of clusters, though
For clusters : iterate through dicts and sum lens. For nodes : iterate through mask and sum 1s.
This is the hardest part and require some programming, we are beyond the scope of stackoverflow probably as we speak about a hundred lines of Python or so.
If you want to fasten the lookups, you need to accept redundancy. If O(n) node lookup isn't acceptable, you can architecture differently, starting with a node list in each cluster. However, if your overlapping is big, so will be memory requirements.
We're in Python, memory requirement are heavy. However, you can externalize big dictionnaries or list to a Redis server. This would be my option to keep a fast lookup since we speak about an efficient in-memory storage.

If you are interested in the code I can release it for you to have a look, but I think you first need to make a choice or two regarding architecture : you can't have full Python full constant time memory efficient large scale data structure IMHO.

High performance many-to-many relationships in python

Question

1 answers

solution1
0 2016-06-02 15:10:21

High performance many-to-many relationships in python

Question

1 answers

solution1 0 2016-06-02 15:10:21

solution1
0 2016-06-02 15:10:21