[英]Fastest way to compare ordered lists and count common elements *including* duplicates
I need to compare two lists of numbers and count how many elements of first list are there in second list. 我需要比较两个数字列表,并计算第二个列表中第一个列表的元素数量。 For example,
例如,
a = [2, 3, 3, 4, 4, 5]
b1 = [0, 2, 2, 3, 3, 4, 6, 8]
here I should get result of 4: I should count '2' 1 time (as it happens only once in first list), '3' - 2 times, '4' - 1 time (as it happens only once in second list). 在这里我应该得到4的结果:我应该计算'2'1次(因为它在第一个列表中只发生一次),'3' - 2次,'4' - 1次(因为它在第二个列表中只发生一次) 。 I was using the following code:
我使用以下代码:
def scoreIn(list1, list2):
score=0
list2c=list(list2)
for i in list1:
if i in list2c:
score+=1
list2c.remove(i)
return score
it works correctly, but too slow for my case (I call it 15000 times). 它工作正常,但我的情况太慢(我称之为15000次)。 I read a hint about 'walking' through sorted lists which was supposed to be faster, so I tried to do like that:
我读了一个关于通过排序列表“行走”的提示,这些列表应该更快,所以我试着这样做:
def scoreWalk(list1, list2):
score=0
i=0
j=0
len1=len(list1) # we assume that list2 is never shorter than list1
while i<len1:
if list1[i]==list2[j]:
score+=1
i+=1
j+=1
elif list1[i]>list2[j]:
j+=1
else:
i+=1
return score
Unfortunately this code is even slower. 不幸的是,这段代码甚至更慢。 Is there any way to make it more efficient?
有没有办法让它更有效率? In my case, both lists are sorted, contains only integers, and list1 is never longer than list2.
在我的例子中,两个列表都是排序的,只包含整数,而list1永远不会比list2长。
You can use the intersection feature of collections.Counter
to solve the problem in an easy and readable way: 您可以使用
collections.Counter
的交集功能以简单易读的方式解决问题:
>>> from collections import Counter
>>> intersection = Counter( [2,3,3,4,4,5] ) & Counter( [0, 2, 2, 3, 3, 4, 6, 8] )
>>> intersection
Counter({3: 2, 2: 1, 4: 1})
As @Bakuriu says in the comments, to obtain the number of elements in the intersection (including duplicates), like your scoreIn
function, you can then use sum( intersection.values() )
. 正如@Bakuriu在评论中所说,要获得交集中元素的数量 (包括重复),就像你的
scoreIn
函数一样,你可以使用sum( intersection.values() )
。
However, doing it this way you're not actually taking advantage of the fact that your data is pre-sorted, nor of the fact (mentioned in the comments) that you're doing this over and over again with the same list. 但是,这样做你实际上并没有利用你的数据是预先排序的事实,也没有利用你在同一个列表中一遍又一遍地执行此操作的事实(在评论中提到)。
Here is a more elaborate solution more specifically tailored for your problem. 这是一个更专业的解决方案,更专门针对您的问题量身定制。 It uses a
Counter
for the static list and directly uses the sorted dynamic list. 它使用
Counter
作为静态列表,并直接使用已排序的动态列表。 On my machine it runs in 43% of the run-time of the naïve Counter
approach on randomly generated test data. 在我的机器上,它在随机生成的测试数据上运行了天真
Counter
方法运行时间的43%。
def common_elements( static_counter, dynamic_sorted_list ):
last = None # previous element in the dynamic list
count = 0 # count seen so far for this element in the dynamic list
total_count = 0 # total common elements seen, eventually the return value
for x in dynamic_sorted_list:
# since the list is sorted, if there's more than one element they
# will be consecutive.
if x == last:
# one more of the same as the previous element
# all we need to do is increase the count
count += 1
else:
# this is a new element that we haven't seen before.
# first "flush out" the current count we've been keeping.
# - count is the number of times it occurred in the dynamic list
# - static_counter[ last ] is the number of times it occurred in
# the static list (the Counter class counted this for us)
# thus the number of occurrences the two have in common is the
# smaller of these numbers. (Note that unlike a normal dictionary,
# which would raise KeyError, a Counter will return zero if we try
# to look up a key that isn't there at all.)
total_count += min( static_counter[ last ], count )
# now set count and last to the new element, starting a new run
count = 1
last = x
if count > 0:
# since we only "flushed" above once we'd iterated _past_ an element,
# the last unique value hasn't been counted. count it now.
total_count += min( static_counter[ last ], count )
return total_count
The idea of this is that you do some of the work up front when you create the Counter
object. 这样做的想法是,在创建
Counter
对象时,您可以Counter
完成一些工作。 Once you've done that work, you can use the Counter
object to quickly look up counts, just like you look up values in a dictionary: static_counter[ x ]
returns the number of times x
occurred in the static list. 完成该工作后,您可以使用
Counter
对象快速查找计数,就像在字典中查找值一样: static_counter[ x ]
返回静态列表中x
出现的次数。
Since the static list is the same every time, you can do this once and use the resulting quick-lookup structure 15 000 times. 由于静态列表每次都相同,因此您可以执行此操作一次并使用生成的快速查找结构15 000次。
On the other hand, setting up a Counter
object for the dynamic list may not pay off performance-wise. 另一方面,为动态列表设置
Counter
对象可能无法在性能方面获得回报。 There is a little bit of overhead involved in creating a Counter
object, and we'd only use each dynamic list Counter
one time. 创建
Counter
对象涉及一些开销,我们只使用每个动态列表Counter
一次。 If we can avoid constructing the object at all, it makes sense to do so. 如果我们可以完全避免构造对象,那么这样做是有意义的。 And as we saw above, you can in fact implement what you need by just iterating through the dynamic list and looking up counts in the other counter.
正如我们在上面看到的那样,您实际上可以通过迭代动态列表并在另一个计数器中查找计数来实现您所需的。
The scoreWalk
function in your post does not handle the case where the biggest item is only in the static list, eg scoreWalk( [1,1,3], [1,1,2] )
. 你的帖子中的
scoreWalk
函数不处理最大项目仅在静态列表中的情况,例如scoreWalk( [1,1,3], [1,1,2] )
。 Correcting that, however, it actually performs better than any of the Counter
approaches for me, contrary to the results you report. 但是,与您报告的结果相反,更正它实际上比我的任何
Counter
方法都表现更好 。 There may be a significant difference in the distribution of your data to my uniformly-distributed test data, but double-check your benchmarking of scoreWalk
just to be sure. 您的数据分布与统一分布的测试数据可能存在显着差异,但请仔细检查您的
scoreWalk
基准测试。
Lastly, consider that you may be using the wrong tool for the job. 最后,请考虑您可能正在使用错误的工具来完成工作。 You're not after short, elegant and readable -- you're trying to squeeze every last bit of performance out of a rather simple task.
你不是追求简短,优雅和可读 - 你试图从一个相当简单的任务中挤出最后一点性能。 CPython allows you to write modules in C .
CPython允许您用C编写模块 。 One of the primary use cases for this is to implement highly optimized code.
其中一个主要用例是实现高度优化的代码。 It may be a good fit for your task.
它可能非常适合您的任务。
You can do this with a dict
comprehension: 你可以用
dict
理解来做到这一点:
>>> a = [2, 3, 3, 4, 4, 5]
>>> b1 = [0, 2, 2, 3, 3, 4, 6, 8]
>>> {k: min(b1.count(k), a.count(k)) for k in set(a)}
{2: 1, 3: 2, 4: 1, 5: 0}
This is much faster if set(a)
is small. 如果
set(a)
很小,这会快得多。 If set(a)
is more than 40 items, the Counter
based solution is faster. 如果
set(a)
超过40个项目,则基于Counter
的解决方案更快。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.