简体   繁体   中英

How to find the repeating arrays in a list

I have a list of around 131000 arrays, each of length 300. I am using python I want to check which of the arrays are repeating in this list. I am trying this by comparing each array with others. like :

Import numpy as np
wordEmbeddings = [[0.8,0.4....upto 300 elements]....upto 131000 arrays]
count = 0
for i in range(0,len(wordEmbeddings)):
   for j in range(0,len(wordEmbeddings)):
      if i != j:
         if np.array_equal(wordEmbeddings[i],wordEmbeddings[j]):
         count += 1

this is running very slowly, It might take hours to finish, how can I do this efficiently ?

You can use collections.Counter to count the frequency of each sub list

>>> from collections import Counter
>>> Counter(list(map(tuple, wordEmbeddings)))

We need to cast the sublist to tuples since list is unhashable ie it cannot be used as a key in dict.

This will give you result like this:

>>> Counter({(...4, 5, 6...): 1, (...1, 2, 3...): 1})

The key of Counter object here is the list and value is the number of times this list occurs. Next you can filter the resulting Counter object to only yield elements where value is > 1:

>>> items = Counter(list(map(tuple, wordEmbeddings)))
>>> list(filter(lambda x: items[x] > 1,items))

Timeit results:

$ python -m timeit -s "a = [range(300) for _ in range(131000)]" -s "from collections import Counter" "Counter(list(map(tuple, a)))"
10 loops, best of 3: 1.18 sec per loop

You can remove duplicate comparisons by using

for i in range(0,len(wordEmbeddings)):
    for j in range(i,len(wordEmbeddings)):

You could look in to pypy for general purpose speed ups.
It might also be worth looking into hashing the arrays somehow.

Here's a question on the speeding up np array comparison . Do the order of the elements matter to you?

You can use set and tuple to find duplicated arrays inside another array. Create a new list contains tuples, we use tuples because lists are unhashable type. And then filter new list with using set.

tuple = list(map(tuple, wordEmbeddings))
duplications = set([t for t in tuple if tuple.count(t) > 1])
print(duplications)

也许您可以将初始列表简化为唯一的哈希或非唯一的总和,然后首先遍历哈希-这可能是比较元素的更快方法

I suggest you first sort the list (might also be helpful for further processing) and then compare. The advantage is that you only need to compare every array element to the previous one:

import numpy as np
from functools import cmp_to_key
wordEmbeddings = [[0.8, 0.4, 0.3, 0.2], [0.2,0.3,0.7], [0.8, 0.4, 0.3, 0.2], [ 1.0, 3.0, 4.0, 5.0]]
def smaller (x,y):
    for i in range(min(len(x), len(y))):
        if x[i] < y[i]:
            return 1
        elif y[i] < x[i]:
            return -1
    if len(x) > len(y):
        return 1
    else:
        return -1
wordEmbeddings = sorted(wordEmbeddings, key=cmp_to_key(smaller))
print(wordEmbeddings)
# output: [[1.0, 3.0, 4.0, 5.0], [0.8, 0.4, 0.3, 0.2], [0.8, 0.4, 0.3, 0.2], [0.2, 0.3, 0.7]]
count = 0
for i in range(1, len(wordEmbeddings)):
    if (np.array_equal(wordEmbeddings[i], wordEmbeddings[i-1])):
        count += 1

print(count)
# output: 1

If N is the length of word embedding and n is the length of the inner array, then your approach was to do O(N*N*n) comparisons. When reducing the comparisons as in con--'s answer, then you still have O(N*N*n/2) comparisons.

Sorting will take O(N*log(N)*n) time and the subsequent step of counting only takes O(N*n) time which all in all is shorter than O(N*N*n/2)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM