简体   繁体   English

有效地找到最相似的集合(Python,数据结构)

[英]Find most similar sets efficiently (Python, data structures)

Suppose I have a couple thousand Python sets in a list called my_sets .假设我在名为my_sets的列表中有几千个 Python 集。 For every set "A" in my_sets , I want to find the five sets (sets "B") in my_sets that contain the highest percentage of set A's members.对于每一套“A” my_sets ,我想找到五套(套“B”)在my_sets包含集合A的成员的比例最高。

I'm currently storing the data as sets and looping over them twice to calculate overlap...我目前将数据存储为集合并循环遍历它们两次以计算重叠...

from random import randint
from heapq import heappush, heappop

my_sets = []

for i in range(20):
    new_set = set()

    for j in range(20):
        new_set.add(randint(0, 50))

    my_sets.append(new_set)

for i in range(len(my_sets)):
    neighbor_heap = []

    for j in range(len(my_sets)):
        if i == j:
            continue

        heappush(neighbor_heap, (1 / len(my_sets[i] & my_sets[j]), j))

    results = []

    while len(results) < 5:
        results.append(heappop(neighbor_heap)[1])

    print('Closest neighbors to set {} are sets {}'.format(i, results))

However, this is obviously an O(N**2) algorithm, so it blows up when my_sets gets long.然而,这显然是一个 O(N**2) 算法,所以当my_sets变长时my_setsmy_sets Is there a better data structure or algorithm that can be implemented in base Python for tackling this?是否有更好的数据结构或算法可以在基础 Python 中实现来解决这个问题? There is no reason that my_sets has to be a list, or that each individual set actually has to be a Python set.没有理由my_sets必须是一个列表,或者每个单独的集合实际上必须是一个 Python 集合。 Any way of storing whether or not each set contains members from a finite list of options would be fine (eg, a bunch of bools or bits in a standardized order).任何存储每个集合是否包含来自有限选项列表的成员的方法都可以(例如,一组标准顺序的布尔值或位)。 And building a more exotic data structure to save time would also be fine.构建一个更奇特的数据结构以节省时间也很好。

(As some people will likely want to point out, I could, of course, structure this as a Numpy array where rows are sets and columns are elements and cells are a 1/0, depending on whether that element is in that set. Then, you'd just do some Numpy operations. This would undoubtedly be faster, but I haven't really improved my algorithm at all, I've just offloaded complexity to somebody else's optimized C/Fortran/whatever.) (正如有些人可能想指出的那样,我当然可以将其构造为一个 Numpy 数组,其中行是集合,列是元素,单元格是 1/0,具体取决于该元素是否在该集合中。然后,你只需要做一些 Numpy 操作。这无疑会更快,但我根本没有真正改进我的算法,我只是将复杂性卸载到其他人优化的 C/Fortran/任何东西上。)

EDIT After a full test, the algorithm I originally posted runs in ~586 seconds under the agreed test conditions.编辑经过全面测试后,我最初发布的算法在约定的测试条件下运行约 586 秒。

Could you:您可以...吗:

  1. invert the sets, to produce for each set element, a list (or set) of the sets which contain it.反转集合,为每个集合元素生成包含它的集合的列表(或集合)。

    which is O ( n * m ) -- for n sets and on average m elements per set.这是O ( n * m ) - 对于n 个集合,每个集合平均有m 个元素。

  2. for each set S , consider its elements, and (using 1 ) construct a list (or heap) of other sets and how many elements each one shares with S -- pick the 'best' 5.对于每个集合S ,考虑它的元素,并(使用1 )构造其他集合的列表(或堆)以及每个集合与S共享多少元素——选择“最佳”5。

    which is O ( n * m * a ), where a is the average number of sets each element is a member of.这是O ( n * m * a ),其中a是每个元素所属的集合的平均数。

How far removed from O ( n * n ) that is obviously depends on m and a .O ( n * n ) 的距离显然取决于ma

Edit : Naive implementation in Python runs in 103 seconds on my machine...编辑:Python 中的幼稚实现在我的机器上运行了 103 秒......

old_time = clock()

my_sets = []

for i in range(10000):
    new_set = set()

    for j in range(200):
        new_set.add(randint(0, 999))

    my_sets.append(new_set)

my_inv_sets = [[] for i in range(1000)]

for i in range(len(my_sets)):
    for j in range(1000):
        if j in my_sets[i]:
            my_inv_sets[j].append(i)

for i in range(len(my_sets)):
    counter = Counter()

    for j in my_sets[i]:
        counter.update(my_inv_sets[j])

    print(counter.most_common(6)[1:])

print(clock() - old_time)

You could reduce the number of passes through the set list by building a list of set indexes associated to each value.您可以通过构建与每个值关联的集合索引列表来减少通过集合列表的次数。 Then, one additional pass through the set list, you can determine which sets have the most common values by compiling the number of indexes for each value of a set.然后,再一次遍历集合列表,您可以通过为集合的每个值编译索引数来确定哪些集合具有最常见的值。

This will improve performance in some cases but, depending on the density of the data it may not be a huge difference.在某些情况下,这将提高性能,但是,根据数据的密度,它可能不会有很大差异。

Here is an example using defaultdict and Counter from the collections module.这是一个使用集合模块中的 defaultdict 和 Counter 的示例。

from collections import defaultdict,Counter
def top5Matches(setList):
    valueSets = defaultdict(list)
    for i,aSet in enumerate(setList):
        for v in aSet: valueSets[v].append(i)
    results = []
    for i,aSet in enumerate(setList):
        counts = Counter()
        for v in aSet: counts.update(valueSets[v])
        counts[i] = 0
        top5      = [setList[j] for j,_ in counts.most_common(5)]
        results.append((aSet,top5))
    return results

In order to compare execution times I took the liberty of embedding your solution in a function.为了比较执行时间,我冒昧地将您的解决方案嵌入到一个函数中。 I also had to make a fix for cases where two sets would have no intersection at all:我还必须修复两个集合根本没有交集的情况:

from heapq import heappush, heappop
def OPSolution(my_sets):
    results = []
    for i in range(len(my_sets)):
        neighbor_heap = []
        for j in range(len(my_sets)):
            if i == j: continue
            heappush(neighbor_heap, (1 / max(1,len(my_sets[i] & my_sets[j])), j))
        top5 = []
        while len(top5) < 5:
            j = heappop(neighbor_heap)[1]
            top5.append(my_sets[j])
        results.append((my_sets[i],top5))    
    return results

Both function return a list of tuples containing the original set and a list of the top 5 sets based on number of common values.这两个函数都返回一个包含原始集合的元组列表和一个基于公共值数量的前 5 个集合的列表。

The two functions produce the same results although the top 5 sets may not be the same when the intersection count is the identical for a 6th (or more) additional sets.这两个函数产生相同的结果,尽管当第 6 个(或更多)附加集合的交集计数相同时,前 5 个集合可能不相同。

from random import randrange

my_sets = [ set(randrange(50) for _ in range(20)) for _ in range(20) ]
opResults = OPSolution(my_sets)
print("OPSolution: (matching counts)")
for i,(aSet,top5) in enumerate(opResults):
    print(i,"Top 5:",[len(aSet&otherSet) for otherSet in top5])
print("")

print("top5Matches: (matching counts)")
t5mResults = top5Matches(my_sets)
for i,(aSet,top5) in enumerate(t5mResults):
    print(i,"Top 5:",[len(aSet&otherSet) for otherSet in top5])
print("")

Output:输出:

OPSolution: (matching counts)
0 Top 5: [8, 7, 7, 7, 6]
1 Top 5: [7, 6, 6, 6, 6]
2 Top 5: [8, 7, 6, 6, 6]
3 Top 5: [8, 7, 7, 6, 6]
4 Top 5: [9, 8, 8, 8, 8]
5 Top 5: [7, 6, 6, 6, 6]
6 Top 5: [8, 8, 8, 7, 6]
7 Top 5: [8, 8, 7, 7, 7]
8 Top 5: [9, 7, 7, 7, 6]
9 Top 5: [8, 8, 8, 7, 7]
10 Top 5: [8, 8, 7, 7, 7]
11 Top 5: [8, 8, 7, 7, 6]
12 Top 5: [8, 7, 7, 7, 7]
13 Top 5: [8, 8, 8, 6, 6]
14 Top 5: [9, 8, 8, 6, 6]
15 Top 5: [6, 6, 5, 5, 5]
16 Top 5: [9, 7, 7, 6, 6]
17 Top 5: [8, 7, 7, 7, 7]
18 Top 5: [8, 8, 7, 6, 6]
19 Top 5: [7, 6, 6, 6, 6]

top5Matches: (matching counts)
0 Top 5: [8, 7, 7, 7, 6]
1 Top 5: [7, 6, 6, 6, 6]
2 Top 5: [8, 7, 6, 6, 6]
3 Top 5: [8, 7, 7, 6, 6]
4 Top 5: [9, 8, 8, 8, 8]
5 Top 5: [7, 6, 6, 6, 6]
6 Top 5: [8, 8, 8, 7, 6]
7 Top 5: [8, 8, 7, 7, 7]
8 Top 5: [9, 7, 7, 7, 6]
9 Top 5: [8, 8, 8, 7, 7]
10 Top 5: [8, 8, 7, 7, 7]
11 Top 5: [8, 8, 7, 7, 6]
12 Top 5: [8, 7, 7, 7, 7]
13 Top 5: [8, 8, 8, 6, 6]
14 Top 5: [9, 8, 8, 6, 6]
15 Top 5: [6, 6, 5, 5, 5]
16 Top 5: [9, 7, 7, 6, 6]
17 Top 5: [8, 7, 7, 7, 7]
18 Top 5: [8, 8, 7, 6, 6]
19 Top 5: [7, 6, 6, 6, 6]

Comparing execution times for various combinations of setting shows that the indexing by value performs better on larger data sets (albeit not by much in some cases):比较各种设置组合的执行时间表明,按值索引在较大的数据集上表现更好(尽管在某些情况下不是很多):

[EDIT] Added Chris Hall's solution to measure the speed improvements provided by limiting the functionality to sets of values in a consecutive range. [编辑] 添加了 Chris Hall 的解决方案,以衡量通过将功能限制为连续范围内的值集所提供的速度改进。 I also had to embed it in a function and test that results were the same.我还必须将它嵌入到一个函数中并测试结果是否相同。 I realized while doing it that, we essentially had the same approach.我意识到这样做时,我们基本上采用了相同的方法。 The main difference is that Chris uses a list instead of a dictionary which constrains the values to a range() for which the size must be provided.主要区别在于 Chris 使用列表而不是字典,该字典将值限制为必须提供大小的 range()。

def chrisHall(my_sets,valueRange):
    results = []
    my_inv_sets = [[] for i in range(valueRange)]
    for i in range(len(my_sets)):
        for j in range(valueRange):
            if j in my_sets[i]:
                my_inv_sets[j].append(i)

    for i in range(len(my_sets)):
        counter = Counter()

        for j in my_sets[i]:
            counter.update(my_inv_sets[j])

        top5 = [my_sets[j] for j,_ in counter.most_common(6)[1:]]
        results.append((my_sets[i],top5))
    return results

Performance tests were also embedded in a function to avoid repeating the boilerplate code:性能测试也嵌入在一个函数中,以避免重复样板代码:

from random import randrange
from timeit import timeit

def compareSolutions(title,setCount,setSize,valueRange,count=1):

    print("-------------------")
    print(title,setCount,"sets of",setSize,"elements in range 0 ...",valueRange)
    testSets = [ set(randrange(valueRange) for _ in range(setSize)) for _ in range(setCount) ]

    t = timeit(lambda: chrisHall(testSets,valueRange),number=count)
    print("chrisHall",t)

    t = timeit(lambda: top5Matches(testSets),number=count)
    print("top5Matches",t)

    t = timeit(lambda: OPSolution(testSets),number=count)
    print("OPSolution",t)

compareSolutions("SIMPLE TEST SET",20,20,50,count=100)
compareSolutions("MORE SETS:",2000,20,50)
compareSolutions("FEWER INTERSECTIONS:",2000,20,500)
compareSolutions("LARGER SETS:",2000,200,500)
compareSolutions("SETTING FROM COMMENTS:",10000,200,1000)

Results:结果:

-------------------
SIMPLE TEST SET 20 sets of 20 elements in range 0 ... 50
chrisHall 0.0766431910000005
top5Matches 0.07549873900000037
OPSolution 0.05089954700000021
-------------------
MORE SETS: 2000 sets of 20 elements in range 0 ... 50
chrisHall 1.274499733999999
top5Matches 1.2646208220000013
OPSolution 3.796912927000001
-------------------
FEWER INTERSECTIONS: 2000 sets of 20 elements in range 0 ... 500
chrisHall 0.4685694170000012
top5Matches 0.42844527900000173
OPSolution 3.5187148479999983
-------------------
LARGER SETS: 2000 sets of 200 elements in range 0 ... 500
chrisHall 8.538208329
top5Matches 8.51855685
OPSolution 23.192823251999997
-------------------
SETTING FROM COMMENTS: 10000 sets of 200 elements in range 0 ... 1000
chrisHall 190.55364428999997
top5Matches 176.066835327
OPSolution 829.934181724

I've used set intersection to find the common elements, and then sort this elements based on the number of common elements they contain, Here's a code you might want to try:我使用集合交集来查找公共元素,然后根据它们包含的公共元素的数量对这些元素进行排序,这是您可能想要尝试的代码:

from random import randint
from heapq import heappush, heappop

my_sets = []
for i in range(20):
    new_set = set()

    for j in range(20):
        new_set.add(randint(0, 50))

    my_sets.append(new_set)


for i in range(len(my_sets)):
    temp = dict()
    for j in range(len(my_sets)):
        if i == j:
            continue

        diff = my_sets[i] & my_sets[j]
        temp[j] = diff

    five = sorted(temp.items(), reverse=True, key=lambda s: len(s[1]))[:5]
    five_indexes = [t[0] for t in five]
    print('Closest neighbors to set {} are sets {}'.format(i, five_indexes))

It's simpler and a bit faster (looks like 5-10% on the large case with limits 10000, 200, 1000) to use heapq.nlargest :使用heapq.nlargest更简单,速度更快(在限制为heapq.nlargest的大型情况下看起来像 5-10%):

for i in range(len(my_sets)):
    results = nlargest(5,
                       (j for j in range(len(my_sets)) if j != i),
                       key=lambda j: len(my_sets[i] & my_sets[j]))
    print('Closest neighbors to set {} are sets {}'.format(i, results))

This doesn't build a heap with N-1 elements but with just 5 elements .这不会用 N-1 个元素构建一个堆,而是只有 5 个元素

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM