通过数据特征是否相交对数据进行分组

Question

Suppose we have a table: 假设我们有一张桌子：

id  | aliases
-------------
0   | ['a0', 'a1', 'a4', 'a11']
1   | ['a3', 'a5']
2   | ['a16', 'a18']
3   | ['a6', 'a8', 'a10']
4   | ['a7', 'a8', 'a9']
5   | ['a3', 'a12', 'a14']
6   | ['a5', 'a16', 'a17']

and I'd like to group all id 's together that map to the same aliases ; 并且我想将映射到相同aliases所有id组合在一起; in other words, the end result groups together all id 's that have aliases that intersect, applied recursively. 换句话说，最终结果将所有具有交叉交叉aliases id递归应用到一起。 In the above case, we would have: 在上述情况下，我们将有：

0 maps to ['a0', 'a1', 'a4', 'a11'] 0映射到['a0', 'a1', 'a4', 'a11']
1 , 2 , 5 , and 6 map to ['a3', 'a5', 'a12', 'a14', 'a16', 'a17', 'a18'] 1 ， 2 ， 5 ，和6映射至['a3', 'a5', 'a12', 'a14', 'a16', 'a17', 'a18']
3 and 4 map to ['a6', 'a7', 'a8', 'a9', 'a10'] 3和4映射到['a6', 'a7', 'a8', 'a9', 'a10']

Is there an efficient way to do this? 有一种有效的方法可以做到这一点吗？ In my actual use case I have around 15M rows. 在我的实际用例中，我大约有1500万行。

There is a naïve approach of streaming the rows and checking if each element of aliases in each new row is in the aliases processed so far; 幼稚的方法是流式处理行并检查到目前为止每个新行中aliases每个元素是否在已处理的aliases ； if so, collecting the id 's together of all rows with matching aliases , and mapping them to the union of the aliases matched. 如果是的话，收集id相匹配的所有行的一起来aliases ，并且将它们映射到的联合aliases匹配。

However, this approach seems computationally impractical. 但是，这种方法在计算上似乎不切实际。

Answer 1

To run a O(n*len(groupcount) complexity code on this table shouldn't be that hard, just off the top of my mind: 在此表上运行O(n*len(groupcount)复杂度代码应该不那么困难，只是我想到了：

Assume you have id as a list of id and aliases as list of lists, you can do: 假设您将id作为id的列表，并将aliases作为list的列表，则可以执行以下操作：

bins = []
sets = []
for i in id: # Assume from (0 - n)
    alias = aliases[i]
    in_set = False
    for j in range(len(sets)):
        if len(sets[j].intersection(set(alias))) > 0:
            sets[j].update(set(alias)) # add alias to set, if any difference
            in_set = True
            bins[j].append(i) # append id to bins
            break
    if not in_set:
        bins.append([i])
        sets.append(set(alias))

bins will contain the id groups, and corresponding element in sets will contain the alias groups, you can use list() to convert these sets back to list . bins将包含id组， sets对应元素将包含alias组，您可以使用list()将这些集合转换回list 。 And since all set operations are hash based this ensure your program runs in O(n*groupcount) time. 并且由于所有设置操作都是基于哈希的，因此可以确保您的程序在O(n*groupcount)时间内运行。

通过数据特征是否相交对数据进行分组

问题描述

1 个解决方案

解决方案1
0 2018-10-22 21:57:55

通过数据特征是否相交对数据进行分组

问题描述

1 个解决方案

解决方案1 0 2018-10-22 21:57:55

解决方案1
0 2018-10-22 21:57:55