简体   繁体   中英

Determine if a Python list is 95% the same?

This question asks how to determine if every element in a list is the same. How would I go about determining if 95% of the elements in a list are the same in a reasonably efficient way? For example:

>>> ninety_five_same([1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1])
True
>>> ninety_five_same([1,1,1,1,1,1,2,1]) # only 80% the same
False

This would need to be somewhat efficient because the lists may be very large.

>>> from collections import Counter
>>> lst = [1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
>>> _, freq = Counter(lst).most_common(1)[0]
>>> len(lst)*.95 <= freq
True

Actually, there's an easy linear solution for similar problem, only with 50% constraint instead of 95%. Check this question , it's just a few lines of code.

It will work for you as well, only in the end you check that selected element satisfies 95% threshold, not 50%. (Although, as Thilo notes, it's not necessary if currentCount >= n*0.95 already.)

I'll also post Python code from st0le 's answer, to show everybody how difficult it is.

currentCount = 0
currentValue = lst[0]
for val in lst:
   if val == currentValue:
      currentCount += 1
   else:
      currentCount -= 1

   if currentCount == 0:
      currentValue = val
      currentCount = 1

If you're looking for explanation, I think Nabb has got the best one .

def ninety_five_same(lst):
    freq = collections.defaultdict(int)
    for x in lst:
        freq[x] += 1
    freqsort = sorted(freq.itervalues())
    return freqsort[-1] >= .95 * sum(freqsort)

Assuming perfect hash table performance and a good sorting algorithm, this runs in O( n + m lg m ), where m is the number of distinct items. O( n lg n ) worst case.

Edit : here's an O( n + m ), single-pass version (assuming m << n ):

def ninety_five_same(lst):
    freq = collections.defaultdict(int)
    for x in lst:
        freq[x] += 1
    freq = freq.values()
    return max(freq) >= .95 * sum(freq)

Memory use is O( m ). max and sum can be replaced by a single loop.

This is even less efficient than checking if every element is the same.

The algorithm is roughly the same, go through every element in the list and count those that do not match the expected one (with the extra difficulty of knowing which one is the expected one). However, this time, you cannot just return false when you meet the first mismatch, you have to continue until you have enough mismatches to make up a 5% error rate.

Come to think of it, figuring out which element is the "right" one is probably not so easy, and involves counting every value up to the point where you can be sure that 5% are misplaced.

Consider a list with 10.000 elements of which 99% are 42:

  (1,2,3,4,5,6,7,8,9,10, ... , 100, 42,42, 42, 42 .... 42)

So I think you would have to start out building a frequency table for at least the first 5% of the table.

def ninety_five_same(l):
  return max([l.count(i) for i in set(l)])*20 >= 19*len(l)

Also eliminating the problem with with accuracy of float division.

Think about your list as a bucket of red and black balls.

If you have one red ball in a bucket of ten balls, and you pick a ball at random and put it back in the bucket, and then repeat that sample-and-replace step a thousand times, how many times out of a thousand do you expect to observe a red ball, on average?

Check out the Binomial distribution and check out confidence intervals . If you have a very long list and want to do things relatively efficiently, sampling is the way to go.

lst = [1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
#lst = [1, 2, 1, 4, 1]
#lst = [1, 2, 1, 4]

length = len(lst)
currentValue = lst[0]
lst.pop(0)
currentCount = 1

for val in lst:
   if currentCount == 0:
      currentValue = val

   if val == currentValue:
      currentCount += 1
   else:
      currentCount -= 1

percent = (currentCount * 50.0 / length + 50)
epsilon = 0.1
if (percent - 50 > epsilon):
    print "Percent %g%%" % percent
else:
    print "No majority"

Note: epsilon has a "random" value, chose something depending on the length of the array etc. Nikita Rybak's solution with currentCount >= n*0.95 won't work, because the value of currentCount differs depending on the order of elements, but the above does work .

C:\Temp>a.py
[2, 1, 1, 4, 1]
currentCount = 1

C:\Temp>a.py
[1, 2, 1, 4, 1]
currentCount = 2

sort as general solution probably is heavy, but consider the exceptional well balanced nature of tim sort in Python, which utilize the existing order of the list. I would suggest to sort the list (or copy of it with sorted, but that copy will hurt the performance). Scan from end and front to find the same element or reach scan length > 5 %, otherwise list is 95% similar with the element found.

Taking random elements as candidates and taking count of them by decreasing order of frequency would not probably also be so bad until found count > 95 % or the total of counts goes over 5%.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM