简体   繁体   English

python中的高性能多对多关系

[英]High performance many-to-many relationships in python

Given cluster and node objects: 给定的clusternode对象:

class Cluster():
   def __init__(self):
       pass


class Node():
    def __init__(self):
        pass

I am wondering what is the best data structure or design that meets the following requirements: 我想知道满足以下要求的最佳数据结构或设计是什么:

  1. Find all the clusters that a given node belongs to. 查找给定node所属的所有clusters
  2. Find all the nodes that belong to a given cluster . 查找属于给定cluster所有nodes
  3. Keep track a numerical quantity that represents how much each node belongs to a cluster , and each cluster to a node . 跟踪一个数值,该数值代表每个node属于一个cluster ,每个cluster属于一个node
  4. Ensure consistency when a node or cluster is deleted or added. 删除或添加nodecluster时,请确保一致性。
  5. Fast lookups, additions and deletes. 快速查找,添加和删除。 (in that order) (以该顺序)
  6. Low memory requirements. 内存需求低。

The number of nodes and clusters will each be in the range of 100,000. 节点和群集的数量将分别在100,000个范围内。

More details of varying relevance: 各种相关性的更多详细信息:

  • A node will always belong to one or more clusters, 一个node将始终属于一个或多个集群,
  • A cluster will always contain one or more nodes. cluster将始终包含一个或多个节点。
  • If a cluster has its only node removed the cluster should be deleted. 如果cluster删除了唯一node ,则应删除该群集。
  • A node will never have all of its clusters removed. node将永远不会删除其所有集群。
  • Example: node1 might belong 90% to cluster14 and 10% to cluster88 示例: node1可能属于cluster14 90%,属于cluster88 10%

I was thinking about using SQLite, but the problem is that storing serialized objects in the database is too slow. 我当时在考虑使用SQLite,但问题是在数据库中存储序列化对象太慢。 I could store object_ids in the database and then look those up in a dict that maps object_ids to object instances, but then there are consistency issues between the dict and the database. 我可以将object_ids存储在数据库中,然后在将object_ids映射到对象实例的dict中查找这些对象,但是dict和数据库之间存在一致性问题。 Additionally fetching a list of instances from the dict is a bit cumbersome. 另外,从dict获取实例列表有点麻烦。

I could possibly store the memory locations of the instances in SQLite but that seems dangerous, and we still have consistancy issues. 我可以在SQLite中存储实例的内存位置,但这似乎很危险,并且我们仍然存在一致性问题。

I implemented a similar data structure on a home project ; 我在家庭项目上实现了类似的数据结构; my own requirements called for a look alike architecture, except i called cluster "tags" (but the core concept is the same). 我自己的要求要求外观类似的体系结构,除了我称群集“标签”(但核心概念相同)。

Here is how you may implement it: 您可以通过以下方式实现它:

  • A list of clusters' names (or classes) 集群名称(或类)列表
  • A dictionnary of lists. 列表字典。 In this dictionnary, keys are bitmasks marking the fact you belonging to a given set of clusters, and values are all the corresponding nodes. 在本词典中,键是位掩码,用于标记您属于给定集群集的事实,而值是所有对应的节点。 Say, if you have cluster 1 to 4, and Node42 belonds to cluster 1 and 3, the dictionnary will have an entry looking like 5:[Node42, ...] 假设,如果您具有集群1至4,并且Node42分别属于集群1和3,则字典将具有一个类似于5:[Node42, ...]的条目5:[Node42, ...]
  • A dictionnary of singletons (it's an optional memory optimisation, since a set weight around 130 bytes in python if i remember correctly, having a dict that adress directly singletons help reduce memory consumption) 单例字典(这是可选的内存优化,因为如果我没有记错的话,在python中设置权重约为130字节,可以直接解决单例的指令有助于减少内存消耗)

About requirements : 关于要求:

  1. Depends, it is O(n) in my initial architecture, but with additional memory consumption you can have this instantly : add a field to each node with it's corresponding key in the dictionnary, then you just have to do the clusters lookups with the mask 取决于,它在我的初始体系结构中是O(n),但是由于额外的内存消耗,您可以立即拥有它:在字典中向每个节点添加具有其对应键的字段,然后只需要使用掩码进行集群查找
  2. O(n) : you have to read the data structure and aggregate nodes belonging to a given cluster. O(n):您必须读取数据结构并聚合属于给定集群的节点。 Best cases are fast, but heavily fragmented structure will be slow. 最好的情况是快速的,但是严重分散的结构将很慢。 For the same price, you can implement lookup of union and intersection of clusters, though 对于相同的价格,您可以实现联合的查找和群集的相交
  3. For clusters : iterate through dicts and sum lens. 对于簇:遍历字典和求和透镜。 For nodes : iterate through mask and sum 1s. 对于节点:遍历掩码并求和1s。
  4. This is the hardest part and require some programming, we are beyond the scope of stackoverflow probably as we speak about a hundred lines of Python or so. 这是最难的部分,需要进行一些编程,我们可能超出了100条Python左右的行,这超出了stackoverflow的范围。
  5. If you want to fasten the lookups, you need to accept redundancy. 如果要固定查找,则需要接受冗余。 If O(n) node lookup isn't acceptable, you can architecture differently, starting with a node list in each cluster. 如果不接受O(n)节点查找,则可以从每个群集中的节点列表开始,进行不同的体系结构。 However, if your overlapping is big, so will be memory requirements. 但是,如果您的重叠量很大,那么内存需求也将如此。
  6. We're in Python, memory requirement are heavy. 我们在Python中,内存需求很重。 However, you can externalize big dictionnaries or list to a Redis server. 但是,您可以将大型词典外部化或列出到Redis服务器。 This would be my option to keep a fast lookup since we speak about an efficient in-memory storage. 由于我们谈论的是高效的内存存储,因此这将是我保持快速查找的选择。

If you are interested in the code I can release it for you to have a look, but I think you first need to make a choice or two regarding architecture : you can't have full Python full constant time memory efficient large scale data structure IMHO. 如果您对代码感兴趣,我可以发布它以供您看一看,但是我认为您首先需要就体系结构做出一两个选择:您无法拥有完整的Python完整的恒定时间内存有效的大规模数据结构IMHO 。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM