简体   繁体   中英

Fastest way to compare ordered lists and count common elements *including* duplicates

I need to compare two lists of numbers and count how many elements of first list are there in second list. For example,

a =  [2, 3, 3, 4, 4, 5]
b1 = [0, 2, 2, 3, 3, 4, 6, 8]

here I should get result of 4: I should count '2' 1 time (as it happens only once in first list), '3' - 2 times, '4' - 1 time (as it happens only once in second list). I was using the following code:

def scoreIn(list1, list2):
   score=0
   list2c=list(list2)
   for i in list1:
      if i in list2c:
         score+=1
         list2c.remove(i)
   return score

it works correctly, but too slow for my case (I call it 15000 times). I read a hint about 'walking' through sorted lists which was supposed to be faster, so I tried to do like that:

def scoreWalk(list1, list2):
   score=0
   i=0
   j=0
   len1=len(list1) # we assume that list2 is never shorter than list1
   while i<len1:
      if list1[i]==list2[j]:
         score+=1
         i+=1
         j+=1
      elif list1[i]>list2[j]:
         j+=1
      else:
         i+=1
   return score

Unfortunately this code is even slower. Is there any way to make it more efficient? In my case, both lists are sorted, contains only integers, and list1 is never longer than list2.

You can use the intersection feature of collections.Counter to solve the problem in an easy and readable way:

>>> from collections import Counter
>>> intersection = Counter( [2,3,3,4,4,5] ) & Counter( [0, 2, 2, 3, 3, 4, 6, 8] )
>>> intersection
Counter({3: 2, 2: 1, 4: 1})

As @Bakuriu says in the comments, to obtain the number of elements in the intersection (including duplicates), like your scoreIn function, you can then use sum( intersection.values() ) .

However, doing it this way you're not actually taking advantage of the fact that your data is pre-sorted, nor of the fact (mentioned in the comments) that you're doing this over and over again with the same list.

Here is a more elaborate solution more specifically tailored for your problem. It uses a Counter for the static list and directly uses the sorted dynamic list. On my machine it runs in 43% of the run-time of the naïve Counter approach on randomly generated test data.

def common_elements( static_counter, dynamic_sorted_list ):
    last = None # previous element in the dynamic list
    count = 0 # count seen so far for this element in the dynamic list

    total_count = 0 # total common elements seen, eventually the return value

    for x in dynamic_sorted_list:
        # since the list is sorted, if there's more than one element they
        # will be consecutive.
        if x == last:
            # one more of the same as the previous  element

            # all we need to do is increase the count
            count += 1
        else:
            # this is a new element that we haven't seen before.

            # first "flush out" the current count we've been keeping.
            #   - count is the number of times it occurred in the dynamic list
            #   - static_counter[ last ] is the number of times it occurred in
            #       the static list (the Counter class counted this for us)
            # thus the number of occurrences the two have in common is the
            # smaller of these numbers. (Note that unlike a normal dictionary,
            # which would raise KeyError, a Counter will return zero if we try
            # to look up a key that isn't there at all.)
            total_count += min( static_counter[ last ], count )

            # now set count and last to the new element, starting a new run
            count = 1
            last = x

    if count > 0:
        # since we only "flushed" above once we'd iterated _past_ an element,
        # the last unique value hasn't been counted. count it now.
        total_count += min( static_counter[ last ], count )

    return total_count

The idea of this is that you do some of the work up front when you create the Counter object. Once you've done that work, you can use the Counter object to quickly look up counts, just like you look up values in a dictionary: static_counter[ x ] returns the number of times x occurred in the static list.

Since the static list is the same every time, you can do this once and use the resulting quick-lookup structure 15 000 times.

On the other hand, setting up a Counter object for the dynamic list may not pay off performance-wise. There is a little bit of overhead involved in creating a Counter object, and we'd only use each dynamic list Counter one time. If we can avoid constructing the object at all, it makes sense to do so. And as we saw above, you can in fact implement what you need by just iterating through the dynamic list and looking up counts in the other counter.

The scoreWalk function in your post does not handle the case where the biggest item is only in the static list, eg scoreWalk( [1,1,3], [1,1,2] ) . Correcting that, however, it actually performs better than any of the Counter approaches for me, contrary to the results you report. There may be a significant difference in the distribution of your data to my uniformly-distributed test data, but double-check your benchmarking of scoreWalk just to be sure.

Lastly, consider that you may be using the wrong tool for the job. You're not after short, elegant and readable -- you're trying to squeeze every last bit of performance out of a rather simple task. CPython allows you to write modules in C . One of the primary use cases for this is to implement highly optimized code. It may be a good fit for your task.

You can do this with a dict comprehension:

>>> a =  [2, 3, 3, 4, 4, 5]
>>> b1 = [0, 2, 2, 3, 3, 4, 6, 8]
>>> {k: min(b1.count(k), a.count(k)) for k in set(a)}
{2: 1, 3: 2, 4: 1, 5: 0}

This is much faster if set(a) is small. If set(a) is more than 40 items, the Counter based solution is faster.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM