[英]Grouping data by whether or not the data's feature intersects
Suppose we have a table: 假设我们有一张桌子:
id | aliases
-------------
0 | ['a0', 'a1', 'a4', 'a11']
1 | ['a3', 'a5']
2 | ['a16', 'a18']
3 | ['a6', 'a8', 'a10']
4 | ['a7', 'a8', 'a9']
5 | ['a3', 'a12', 'a14']
6 | ['a5', 'a16', 'a17']
and I'd like to group all id
's together that map to the same aliases
; 并且我想将映射到相同
aliases
所有id
组合在一起; in other words, the end result groups together all id
's that have aliases
that intersect, applied recursively. 换句话说,最终结果将所有具有交叉交叉
aliases
id
递归应用到一起。 In the above case, we would have: 在上述情况下,我们将有:
0
maps to ['a0', 'a1', 'a4', 'a11']
0
映射到['a0', 'a1', 'a4', 'a11']
1
, 2
, 5
, and 6
map to ['a3', 'a5', 'a12', 'a14', 'a16', 'a17', 'a18']
1
, 2
, 5
,和6
映射至['a3', 'a5', 'a12', 'a14', 'a16', 'a17', 'a18']
3
and 4
map to ['a6', 'a7', 'a8', 'a9', 'a10']
3
和4
映射到['a6', 'a7', 'a8', 'a9', 'a10']
Is there an efficient way to do this? 有一种有效的方法可以做到这一点吗? In my actual use case I have around 15M rows.
在我的实际用例中,我大约有1500万行。
There is a naïve approach of streaming the rows and checking if each element of aliases
in each new row is in the aliases
processed so far; 幼稚的方法是流式处理行并检查到目前为止每个新行中
aliases
每个元素是否在已处理的aliases
; if so, collecting the id
's together of all rows with matching aliases
, and mapping them to the union of the aliases
matched. 如果是的话,收集
id
相匹配的所有行的一起来aliases
,并且将它们映射到的联合aliases
匹配。
However, this approach seems computationally impractical. 但是,这种方法在计算上似乎不切实际。
To run a O(n*len(groupcount)
complexity code on this table shouldn't be that hard, just off the top of my mind: 在此表上运行
O(n*len(groupcount)
复杂度代码应该不那么困难,只是我想到了:
Assume you have id
as a list of id and aliases
as list of lists, you can do: 假设您将
id
作为id
的列表,并将aliases
作为list的列表,则可以执行以下操作:
bins = []
sets = []
for i in id: # Assume from (0 - n)
alias = aliases[i]
in_set = False
for j in range(len(sets)):
if len(sets[j].intersection(set(alias))) > 0:
sets[j].update(set(alias)) # add alias to set, if any difference
in_set = True
bins[j].append(i) # append id to bins
break
if not in_set:
bins.append([i])
sets.append(set(alias))
bins
will contain the id
groups, and corresponding element in sets
will contain the alias
groups, you can use list()
to convert these sets back to list
. bins
将包含id
组, sets
对应元素将包含alias
组,您可以使用list()
将这些集合转换回list
。 And since all set operations are hash based this ensure your program runs in O(n*groupcount)
time. 并且由于所有设置操作都是基于哈希的,因此可以确保您的程序在
O(n*groupcount)
时间内运行。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.