简体   繁体   中英

fastest method of getting k smallest numbers in unsorted list of size N in python?

What is the fastest method to get the k smallest numbers in an unsorted list of size N using python?
Is it faster to sort the big list of numbers, and then get the k smallest numbers,
or to get the k smallest numbers by finding the minimum in the list k times, making sure u remove the found minimum from the search before the next search?

You could use a heap queue; it can give you the K largest or smallest numbers out of a list of size N in O(NlogK) time.

The Python standard library includes the heapq module , complete with a heapq.nsmallest() function ready implemented:

import heapq

k_smallest = heapq.nsmallest(k, input_list)

Internally, this creates a heap of size K with the first K elements of the input list, then iterating over the remaining NK elements, pushing each to the heap, then popping off the largest one. Such a push and pop takes log K time, making the overall operation O(NlogK).

The function also optimises the following edge cases:

  • If K is 1, the min() function is used instead, giving you a O(N) result.
  • If K >= N, the function uses sorting instead, since O(NlogN) would beat O(NlogK) in that case.

A better option is to use the introselect algorithm , which offers an O(n) option. The only implementation I am aware of is using the numpy.partition() function :

import numpy

# assuming you have a python list, you need to convert to a numpy array first
array = numpy.array(input_list)
# partition, slice back to the k smallest elements, convert back to a Python list
k_smallest = numpy.partition(array, k)[:k].tolist()

Apart from requiring installation of numpy , this also takes N memory (versus K for heapq ), as a copy of the list is created for the partition.

If you only wanted indices, you can use, for either variant:

heapq.nsmallest(k, range(len(input_list)), key=input_list.__getitem__)  # O(NlogK)
numpy.argpartition(numpy.array(input_list), k)[:k].tolist()  # O(N)

EDIT: this assumes that the list is immutable. If the list is an array and can be modified there are linear methods available.

You can get the complexity down to O(n * log k) by using a heap of size k + 1 .

  1. Initially get the first k elements into a min-heap.
  2. For every subsequent element, add the element as a leaf and heapify.
  3. Replace the last element with the next element.

Heapify can be done in logarithmic time and hence the time complexity is as above.

You can do it in O(kn) with a selection algorithm . Once kn >= n log n , switch to sorting. That said, the constant on the selection algorithm tends to be a lot higher than the one on quicksort, so you really need to compare i (kn) and j (n log n) . In practice, it's usually more desirable to just sort unless you're dealing with large n or very small k .

Edit: see comments. It's actually a lot better.

You might want to take a look at heapq :

In [109]: L = [random.randint(1,1000) for _ in range(100)]

In [110]: heapq.nsmallest(10, L)
Out[110]: [1, 17, 17, 19, 24, 37, 37, 45, 63, 73]

If the list of the kth smallest numbers doesn't need to be sorted, this can be done in O(n) time with a selection algorithm like introselect . The standard library doesn't come with one, but NumPy has numpy.partition for the job:

partitioned = numpy.partition(l, k)
# The subarray partitioned[:k] now contains the k smallest elements.

Using nsmallest numbers in heapq is less code but if you are looking to implement it yourself this is a simple way to do it. This solution requires looping through the data once only but since heappush and heappop run on O(log n) this algorithm would perform best on smaller numbers of k.

import heapq

def getsmallest(arr, k):
    m = [-x for x in l[:k]]
    heapq.heapify(m)
    for num in arr[5:]:
        print num, m
        heapq.heappush(m, max(-num, heapq.heappop(m)))
    return m

if __name__ == '__main__':
    l = [1,2,3,52,2,3,1]
    print getsmallest(l, 5)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM