简体   繁体   English

通过数据特征是否相交对数据进行分组

[英]Grouping data by whether or not the data's feature intersects

Suppose we have a table: 假设我们有一张桌子:

id  | aliases
-------------
0   | ['a0', 'a1', 'a4', 'a11']
1   | ['a3', 'a5']
2   | ['a16', 'a18']
3   | ['a6', 'a8', 'a10']
4   | ['a7', 'a8', 'a9']
5   | ['a3', 'a12', 'a14']
6   | ['a5', 'a16', 'a17']

and I'd like to group all id 's together that map to the same aliases ; 并且我想将映射到相同aliases所有id组合在一起; in other words, the end result groups together all id 's that have aliases that intersect, applied recursively. 换句话说,最终结果将所有具有交叉交叉aliases id递归应用到一起。 In the above case, we would have: 在上述情况下,我们将有:

  • 0 maps to ['a0', 'a1', 'a4', 'a11'] 0映射到['a0', 'a1', 'a4', 'a11']
  • 1 , 2 , 5 , and 6 map to ['a3', 'a5', 'a12', 'a14', 'a16', 'a17', 'a18'] 125 ,和6映射至['a3', 'a5', 'a12', 'a14', 'a16', 'a17', 'a18']
  • 3 and 4 map to ['a6', 'a7', 'a8', 'a9', 'a10'] 34映射到['a6', 'a7', 'a8', 'a9', 'a10']

Is there an efficient way to do this? 有一种有效的方法可以做到这一点吗? In my actual use case I have around 15M rows. 在我的实际用例中,我大约有1500万行。

There is a naïve approach of streaming the rows and checking if each element of aliases in each new row is in the aliases processed so far; 幼稚的方法是流式处理行并检查到目前为止每个新行中aliases每个元素是否在已处理的aliases if so, collecting the id 's together of all rows with matching aliases , and mapping them to the union of the aliases matched. 如果是的话,收集id相匹配的所有行的一起来aliases ,并且将它们映射到的联合aliases匹配。

However, this approach seems computationally impractical. 但是,这种方法在计算上似乎不切实际。

To run a O(n*len(groupcount) complexity code on this table shouldn't be that hard, just off the top of my mind: 在此表上运行O(n*len(groupcount)复杂度代码应该不那么困难,只是我想到了:

Assume you have id as a list of id and aliases as list of lists, you can do: 假设您将id作为id的列表,并将aliases作为list的列表,则可以执行以下操作:

bins = []
sets = []
for i in id: # Assume from (0 - n)
    alias = aliases[i]
    in_set = False
    for j in range(len(sets)):
        if len(sets[j].intersection(set(alias))) > 0:
            sets[j].update(set(alias)) # add alias to set, if any difference
            in_set = True
            bins[j].append(i) # append id to bins
            break
    if not in_set:
        bins.append([i])
        sets.append(set(alias))

bins will contain the id groups, and corresponding element in sets will contain the alias groups, you can use list() to convert these sets back to list . bins将包含id组, sets对应元素将包含alias组,您可以使用list()将这些集合转换回list And since all set operations are hash based this ensure your program runs in O(n*groupcount) time. 并且由于所有设置操作都是基于哈希的,因此可以确保您的程序在O(n*groupcount)时间内运行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM