简体   繁体   English

比较有序列表和计算常用元素*的最快方法,包括*重复项

[英]Fastest way to compare ordered lists and count common elements *including* duplicates

I need to compare two lists of numbers and count how many elements of first list are there in second list. 我需要比较两个数字列表,并计算第二个列表中第一个列表的元素数量。 For example, 例如,

a =  [2, 3, 3, 4, 4, 5]
b1 = [0, 2, 2, 3, 3, 4, 6, 8]

here I should get result of 4: I should count '2' 1 time (as it happens only once in first list), '3' - 2 times, '4' - 1 time (as it happens only once in second list). 在这里我应该得到4的结果:我应该计算'2'1次(因为它在第一个列表中只发生一次),'3' - 2次,'4' - 1次(因为它在第二个列表中只发生一次) 。 I was using the following code: 我使用以下代码:

def scoreIn(list1, list2):
   score=0
   list2c=list(list2)
   for i in list1:
      if i in list2c:
         score+=1
         list2c.remove(i)
   return score

it works correctly, but too slow for my case (I call it 15000 times). 它工作正常,但我的情况太慢(我称之为15000次)。 I read a hint about 'walking' through sorted lists which was supposed to be faster, so I tried to do like that: 我读了一个关于通过排序列表“行走”的提示,这些列表应该更快,所以我试着这样做:

def scoreWalk(list1, list2):
   score=0
   i=0
   j=0
   len1=len(list1) # we assume that list2 is never shorter than list1
   while i<len1:
      if list1[i]==list2[j]:
         score+=1
         i+=1
         j+=1
      elif list1[i]>list2[j]:
         j+=1
      else:
         i+=1
   return score

Unfortunately this code is even slower. 不幸的是,这段代码甚至更慢。 Is there any way to make it more efficient? 有没有办法让它更有效率? In my case, both lists are sorted, contains only integers, and list1 is never longer than list2. 在我的例子中,两个列表都是排序的,只包含整数,而list1永远不会比list2长。

You can use the intersection feature of collections.Counter to solve the problem in an easy and readable way: 您可以使用collections.Counter的交集功能以简单易读的方式解决问题:

>>> from collections import Counter
>>> intersection = Counter( [2,3,3,4,4,5] ) & Counter( [0, 2, 2, 3, 3, 4, 6, 8] )
>>> intersection
Counter({3: 2, 2: 1, 4: 1})

As @Bakuriu says in the comments, to obtain the number of elements in the intersection (including duplicates), like your scoreIn function, you can then use sum( intersection.values() ) . 正如@Bakuriu在评论中所说,要获得交集中元素的数量 (包括重复),就像你的scoreIn函数一样,你可以使用sum( intersection.values() )

However, doing it this way you're not actually taking advantage of the fact that your data is pre-sorted, nor of the fact (mentioned in the comments) that you're doing this over and over again with the same list. 但是,这样做你实际上并没有利用你的数据是预先排序的事实,也没有利用你在同一个列表中一遍又一遍地执行此操作的事实(在评论中提到)。

Here is a more elaborate solution more specifically tailored for your problem. 这是一个更专业的解决方案,更专门针对您的问题量身定制。 It uses a Counter for the static list and directly uses the sorted dynamic list. 它使用Counter作为静态列表,并直接使用已排序的动态列表。 On my machine it runs in 43% of the run-time of the naïve Counter approach on randomly generated test data. 在我的机器上,它在随机生成的测试数据上运行了天真Counter方法运行时间的43%。

def common_elements( static_counter, dynamic_sorted_list ):
    last = None # previous element in the dynamic list
    count = 0 # count seen so far for this element in the dynamic list

    total_count = 0 # total common elements seen, eventually the return value

    for x in dynamic_sorted_list:
        # since the list is sorted, if there's more than one element they
        # will be consecutive.
        if x == last:
            # one more of the same as the previous  element

            # all we need to do is increase the count
            count += 1
        else:
            # this is a new element that we haven't seen before.

            # first "flush out" the current count we've been keeping.
            #   - count is the number of times it occurred in the dynamic list
            #   - static_counter[ last ] is the number of times it occurred in
            #       the static list (the Counter class counted this for us)
            # thus the number of occurrences the two have in common is the
            # smaller of these numbers. (Note that unlike a normal dictionary,
            # which would raise KeyError, a Counter will return zero if we try
            # to look up a key that isn't there at all.)
            total_count += min( static_counter[ last ], count )

            # now set count and last to the new element, starting a new run
            count = 1
            last = x

    if count > 0:
        # since we only "flushed" above once we'd iterated _past_ an element,
        # the last unique value hasn't been counted. count it now.
        total_count += min( static_counter[ last ], count )

    return total_count

The idea of this is that you do some of the work up front when you create the Counter object. 这样做的想法是,在创建Counter对象时,您可以Counter完成一些工作。 Once you've done that work, you can use the Counter object to quickly look up counts, just like you look up values in a dictionary: static_counter[ x ] returns the number of times x occurred in the static list. 完成该工作后,您可以使用Counter对象快速查找计数,就像在字典中查找值一样: static_counter[ x ]返回静态列表中x出现的次数。

Since the static list is the same every time, you can do this once and use the resulting quick-lookup structure 15 000 times. 由于静态列表每次都相同,因此您可以执行此操作一次并使用生成的快速查找结构15 000次。

On the other hand, setting up a Counter object for the dynamic list may not pay off performance-wise. 另一方面,为动态列表设置Counter对象可能无法在性能方面获得回报。 There is a little bit of overhead involved in creating a Counter object, and we'd only use each dynamic list Counter one time. 创建Counter对象涉及一些开销,我们只使用每个动态列表Counter一次。 If we can avoid constructing the object at all, it makes sense to do so. 如果我们可以完全避免构造对象,那么这样做是有意义的。 And as we saw above, you can in fact implement what you need by just iterating through the dynamic list and looking up counts in the other counter. 正如我们在上面看到的那样,您实际上可以通过迭代动态列表并在另一个计数器中查找计数来实现您所需的。

The scoreWalk function in your post does not handle the case where the biggest item is only in the static list, eg scoreWalk( [1,1,3], [1,1,2] ) . 你的帖子中的scoreWalk函数不处理最大项目仅在静态列表中的情况,例如scoreWalk( [1,1,3], [1,1,2] ) Correcting that, however, it actually performs better than any of the Counter approaches for me, contrary to the results you report. 但是,与您报告的结果相反,更正它实际上比我的任何Counter方法都表现更好 There may be a significant difference in the distribution of your data to my uniformly-distributed test data, but double-check your benchmarking of scoreWalk just to be sure. 您的数据分布与统一分布的测试数据可能存在显着差异,但请仔细检查您的scoreWalk基准测试。

Lastly, consider that you may be using the wrong tool for the job. 最后,请考虑您可能正在使用错误的工具来完成工作。 You're not after short, elegant and readable -- you're trying to squeeze every last bit of performance out of a rather simple task. 你不是追求简短,优雅和可读 - 你试图从一个相当简单的任务中挤出最后一点性能。 CPython allows you to write modules in C . CPython允许您用C编写模块 One of the primary use cases for this is to implement highly optimized code. 其中一个主要用例是实现高度优化的代码。 It may be a good fit for your task. 它可能非常适合您的任务。

You can do this with a dict comprehension: 你可以用dict理解来做到这一点:

>>> a =  [2, 3, 3, 4, 4, 5]
>>> b1 = [0, 2, 2, 3, 3, 4, 6, 8]
>>> {k: min(b1.count(k), a.count(k)) for k in set(a)}
{2: 1, 3: 2, 4: 1, 5: 0}

This is much faster if set(a) is small. 如果set(a)很小,这会快得多。 If set(a) is more than 40 items, the Counter based solution is faster. 如果set(a)超过40个项目,则基于Counter的解决方案更快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM