I have a list of around 131000 arrays, each of length 300. I am using python I want to check which of the arrays are repeating in this list. I am trying this by comparing each array with others. like :
Import numpy as np
wordEmbeddings = [[0.8,0.4....upto 300 elements]....upto 131000 arrays]
count = 0
for i in range(0,len(wordEmbeddings)):
for j in range(0,len(wordEmbeddings)):
if i != j:
if np.array_equal(wordEmbeddings[i],wordEmbeddings[j]):
count += 1
this is running very slowly, It might take hours to finish, how can I do this efficiently ?
You can use collections.Counter
to count the frequency of each sub list
>>> from collections import Counter
>>> Counter(list(map(tuple, wordEmbeddings)))
We need to cast the sublist to tuples since list is unhashable ie it cannot be used as a key in dict.
This will give you result like this:
>>> Counter({(...4, 5, 6...): 1, (...1, 2, 3...): 1})
The key of Counter
object here is the list and value is the number of times this list occurs. Next you can filter the resulting Counter
object to only yield elements where value is > 1:
>>> items = Counter(list(map(tuple, wordEmbeddings)))
>>> list(filter(lambda x: items[x] > 1,items))
Timeit results:
$ python -m timeit -s "a = [range(300) for _ in range(131000)]" -s "from collections import Counter" "Counter(list(map(tuple, a)))"
10 loops, best of 3: 1.18 sec per loop
You can remove duplicate comparisons by using
for i in range(0,len(wordEmbeddings)):
for j in range(i,len(wordEmbeddings)):
You could look in to pypy for general purpose speed ups.
It might also be worth looking into hashing the arrays somehow.
Here's a question on the speeding up np array comparison . Do the order of the elements matter to you?
You can use set
and tuple
to find duplicated arrays inside another array. Create a new list contains tuples, we use tuples because lists are unhashable type. And then filter new list with using set.
tuple = list(map(tuple, wordEmbeddings))
duplications = set([t for t in tuple if tuple.count(t) > 1])
print(duplications)
也许您可以将初始列表简化为唯一的哈希或非唯一的总和,然后首先遍历哈希-这可能是比较元素的更快方法
I suggest you first sort the list (might also be helpful for further processing) and then compare. The advantage is that you only need to compare every array element to the previous one:
import numpy as np
from functools import cmp_to_key
wordEmbeddings = [[0.8, 0.4, 0.3, 0.2], [0.2,0.3,0.7], [0.8, 0.4, 0.3, 0.2], [ 1.0, 3.0, 4.0, 5.0]]
def smaller (x,y):
for i in range(min(len(x), len(y))):
if x[i] < y[i]:
return 1
elif y[i] < x[i]:
return -1
if len(x) > len(y):
return 1
else:
return -1
wordEmbeddings = sorted(wordEmbeddings, key=cmp_to_key(smaller))
print(wordEmbeddings)
# output: [[1.0, 3.0, 4.0, 5.0], [0.8, 0.4, 0.3, 0.2], [0.8, 0.4, 0.3, 0.2], [0.2, 0.3, 0.7]]
count = 0
for i in range(1, len(wordEmbeddings)):
if (np.array_equal(wordEmbeddings[i], wordEmbeddings[i-1])):
count += 1
print(count)
# output: 1
If N is the length of word embedding and n is the length of the inner array, then your approach was to do O(N*N*n)
comparisons. When reducing the comparisons as in con--'s answer, then you still have O(N*N*n/2)
comparisons.
Sorting will take O(N*log(N)*n)
time and the subsequent step of counting only takes O(N*n)
time which all in all is shorter than O(N*N*n/2)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.