简体   繁体   中英

Python most efficient way to keep sorted data

What is the most efficient way to keep track of data in a ascending/descending order. Lets say I have a stream of data, assume this is very large. Sample stream:

key,mod,value
5,add,1
2,add,3
4,add,2
2,add,2
2,rem,5

As I am reading the stream, I place it in a dictionary to keep track of the contents. For example, at the end of the mini stream above, I will have a dictionary with {5:1, 4:2} . Where add means the value is increased by the value, and rem means you are removing that much from that key. If the value goes to 0, then delete that key from the dictionary. But I also want to be able to print the data in order (but doesnt have to be all the time.) I do want to keep track of the highest/lowest keys, so that I know when the highest/lowest value has changed. Either the key changed or its value changed.

The way I am doing it right now is fill/delete keys from dictionary accordingly. This should be constant O(1). Keep track of a sorted_keys list, where each stream checks if the new number is in the dictionary, if not, will do bisect.insort_right(sorted_keys, key) . So the sorted_keys is always sorted at all times. Assuming adding 1 value in a sorted list is quick, though it does need to extend the size so this may take O(n) still. And i keep track of prev_highest or prev_lowest , and check this against sorted_keys[0] or sorted_keys[-1] respectively.

I tried using deque with bisect.insort_right, SortedDict from sortedcontainers, linked list, OrderedDict, but it seems that the above seems to work best. Is there another potential implementation that could be more optimized? Or should I just keep track of a certain level in order, say 10 items in order. And update that accordingly. But the problem with this is that if there is a new key, how do i know its one of the new top or not? Seems like having a heapq would help, but I cant get the sorted values until i pop them. And if I need to print the whole thing in order, I just sort the keys of the entire dictionary.

Edit: Adding my tests using bisect and SortedDict below:

import timeit
import bisect
import random
from sortedcontainers import SortedDict

NUM_ITERATION_TEST = 10
TOTAL_NUM_DATA = 1000000
MODS = ['add', 'rem']
QUANTITY = [1, 5, 10, 20, 100, 200, 300, 500, 1000]

DATA = [{'mod': random.choice(MODS),
         'key': random.randint(0, 1000),
         'val': random.choice(QUANTITY)} for x in range(TOTAL_NUM_DATA)]


def method1(DATA):
    d = {}
    sorted_keys = []

    for data in DATA:
        if data['mod'] == 'add':
            key = data['key']
            if key in d.keys():
                d[key] += data['val']
            else:
                d[key] = data['val']
                bisect.insort_right(sorted_keys, key)
        elif data['mod'] == 'rem':
            key = data['key']
            if key in d.keys():
                if d[key] <= data['val']:
                    del d[key]
                    sorted_keys.remove(key)           
                else:
                    d[key] -= data['val']
            else:
                pass # Deleting something not there yet

def method2(DATA):
    d = SortedDict()

    for data in DATA:
        if data['mod'] == 'add':
            key = data['key']
            if key in d.keys():
                d[key] += data['val']
            else:
                d[key] = data['val']
        elif data['mod'] == 'rem':
            key = data['key']
            if key in d.keys():
                if d[key] <= data['val']:
                    del d[key]
                else:
                    d[key] -= data['val']
            else:
                pass  # Deleting something not there yet


if __name__ == "__main__":
    # METHOD 1
    print("Method 1 Execution Time:")
    print(timeit.timeit("test_timeit.method1(test_timeit.DATA)",
                        number=NUM_ITERATION_TEST,
                        setup="import test_timeit"))

    # METHOD 2
    print("Method 2 Execution Time:")
    print(timeit.timeit("test_timeit.method2(test_timeit.DATA)",
                        number=NUM_ITERATION_TEST,
                        setup="import test_timeit"))

The results for the above is:

Method 1 Execution Time:
4.427699800000001
Method 2 Execution Time:
12.7445671

For data that fits in memory, "SortedDict from sortedcontainers" (which you already tried) is generally as good as it gets for keeping such a dict in sorted order. But lookup time is (roughly) O(log N) (see edit at the end - that appears to be false.).

Assuming adding 1 value in a sorted list is quick, though it does need to extend the size so this may take O(n) still.

In a Python list L , inserting an element at index i has to - at a minimum - physically move len(L) - i pointers, which means 8x times that many bytes on a 64-bit box. That's where sortedcontainers gets its huge wins when data gets "large": the worst-case number of pointers it needs to physically move is bounded by a constant independent of len(L) . Until len(L) gets into the thousands, it can be hard to notice the difference. But when len(L) gets into the millions, the difference is huge.

I'd try a compromise: use a sortedcontainers SortedList to keep track of the current keys, and a plain Python dict for the actual dict. Then:

For "key add value": see if the key is in the dict. Very fast. If it is, no need to touch the SortedList. Just mutate the dict. If the key is not in the dict, then it needs to be added to the SortedList and the dict.

For "key rem value": see if the key in the dict. If it's not, I have no idea what you want to do, but you'll figure it out;-) But if it is in the dict, subtract the value. If the result is non-zero you're done. Else (result is 0), remove the key from the dict and the SortedList.

Note: I'm suggesting SortedList instead of SortedSet not for semantic reasons, but because SortedSet requires more memory, to maintain a set in parallel with a sorted list. You have no use for the set.

What you may really want in addition to a dict is a double-ended ("min max") heap . It's impossible to guess from what you've said - it depends on, eg, how often you only want to know "the smallest and/or the largest" compared to often you want to materialize the entire sorted order. But I don't know of a min-max heap implementation for Python that's built for speed - they're messy beasts to code, and rarely used.

EDIT

On second thought, it seems that a sortedcontainer's SortedDict already combines a SortedList with (a subclass of) a plain Python dict. For example, setting a value in a SortedDict is implemented like so:

def __setitem__(self, key, value):
    if key not in self:
        self._list_add(key)
    dict.__setitem__(self, key, value)

So it only touches the SortedList if the key isn't already in the dict. There's not much opportunity for improving on this if you maintain your own <SortedList, dict> pair.

ROLLING YOUR OWN

Here's another to try:

def method3(DATA):
    sorted_keys = SortedList()
    d = {}

    for data in DATA:
        if data['mod'] == 'add':
            key = data['key']
            if key in d:
                d[key] += data['val']
            else:
                d[key] = data['val']
                sorted_keys.add(key)
        elif data['mod'] == 'rem':
            key = data['key']
            if key in d:
                if d[key] <= data['val']:
                    del d[key]
                    sorted_keys.remove(key)
                else:
                    d[key] -= data['val']
            else:
                pass  # Deleting something not there yet

This implements my original suggestion: maintain your own pair of a SortedList with a plain Python dict. It has the same O() behavior as using a SortedDict, but appears significantly faster by a constant factor. The seems partly because the dict operations are all coded in C now (SortedDict is coded in Python), and the rest because we only test for dict membership once per data item. For example, in

if key in d:
    d[key] += data['val']

when d is a SortedDict, key in d tests it once explicitly, but the implementation of d.__setitem__() has to test it again, so that it can add key to its hidden SortedList if the key is unknown. From our higher-level view, we already know the key is in the dict in the body of the if , so can ignore our explicit SortedList entirely there.

You made two mistakes in your method1 :

  • You check if key in d.keys(): instead of if key in d: . There's no point creating a keys view there.

  • You remove from the list with sorted_keys.remove(key) instead of using bisect to find the index and then deleting that index.

Fixing those, storing some methods in local variables for shorter/faster access, and using d.get to possibly find a key and get its value (instead of an in check followed by looking up the value), I get these times (method1/2 are yours, method3 is Tim's, method4 is mine):

Round 1
method1 7.590627200000004
method2 19.851634099999984
method3 6.093115100000006
method4 5.069753999999989

Round 2
method1 7.857367500000009
method2 19.59779759999998
method3 6.057990299999972
method4 5.0046839999999975

Round 3
method1 7.843560700000012
method2 19.8673627
method3 6.079332300000033
method4 5.073929300000032

Tim challenged me to change randint(0, 1000) to randint(0, 40_000) :

method1 607.2835661000001
method2 26.667593300000135
method3 12.84969140000021
method4 16.68231250000008

And randint(0, 400_000) (only the faster solutions):

method3 20.179627500000002
method4 115.39424580000002

My version:

def method4(DATA):
    d = {}
    sorted_keys = []
    insort = bisect.insort_right
    index = bisect.bisect_left
    get = d.get
    
    for data in DATA:
        if data['mod'] == 'add':
            key = data['key']
            val = get(key)
            if val:
                d[key] = val + data['val']
            else:
                d[key] = data['val']
                insort(sorted_keys, key)
        elif data['mod'] == 'rem':
            key = data['key']
            val = get(key)
            if val:
                if val <= data['val']:
                    del d[key]
                    del sorted_keys[index(sorted_keys, key)]
                else:
                    d[key] = val - data['val']
            else:
                pass # Deleting something not there yet

Full benchmark code including correctness checking:

import timeit
import bisect
import random
from sortedcontainers import SortedDict, SortedList

NUM_ITERATION_TEST = 10
TOTAL_NUM_DATA = 1000000
MODS = ['add', 'rem']
QUANTITY = [1, 5, 10, 20, 100, 200, 300, 500, 1000]

DATA = [{'mod': random.choice(MODS),
         'key': random.randint(0, 1000),
         'val': random.choice(QUANTITY)} for x in range(TOTAL_NUM_DATA)]

def method1(DATA, return_=False):
    d = {}
    sorted_keys = []

    for data in DATA:
        if data['mod'] == 'add':
            key = data['key']
            if key in d.keys():
                d[key] += data['val']
            else:
                d[key] = data['val']
                bisect.insort_right(sorted_keys, key)
        elif data['mod'] == 'rem':
            key = data['key']
            if key in d.keys():
                if d[key] <= data['val']:
                    del d[key]
                    sorted_keys.remove(key)           
                else:
                    d[key] -= data['val']
            else:
                pass # Deleting something not there yet
    if return_:
        return d, sorted_keys

def method2(DATA, return_=False):
    d = SortedDict()

    for data in DATA:
        if data['mod'] == 'add':
            key = data['key']
            if key in d.keys():
                d[key] += data['val']
            else:
                d[key] = data['val']
        elif data['mod'] == 'rem':
            key = data['key']
            if key in d.keys():
                if d[key] <= data['val']:
                    del d[key]
                else:
                    d[key] -= data['val']
            else:
                pass  # Deleting something not there yet
    if return_:
        return dict(d), list(d)

def method3(DATA, return_=False):
    sorted_keys = SortedList()
    d = {}

    for data in DATA:
        if data['mod'] == 'add':
            key = data['key']
            if key in d:
                d[key] += data['val']
            else:
                d[key] = data['val']
                sorted_keys.add(key)
        elif data['mod'] == 'rem':
            key = data['key']
            if key in d:
                if d[key] <= data['val']:
                    del d[key]
                    sorted_keys.remove(key)
                else:
                    d[key] -= data['val']
            else:
                pass  # Deleting something not there yet
    if return_:
        return d, list(sorted_keys)

def method4(DATA, return_=False):
    d = {}
    sorted_keys = []
    insort = bisect.insort_right
    index = bisect.bisect_left
    get = d.get
    
    for data in DATA:
        if data['mod'] == 'add':
            key = data['key']
            val = get(key)
            if val:
                d[key] = val + data['val']
            else:
                d[key] = data['val']
                insort(sorted_keys, key)
        elif data['mod'] == 'rem':
            key = data['key']
            val = get(key)
            if val:
                if val <= data['val']:
                    del d[key]
                    del sorted_keys[index(sorted_keys, key)]
                else:
                    d[key] = val - data['val']
            else:
                pass # Deleting something not there yet
    if return_:
        return d, sorted_keys

methods = method1, method2, method3, method4

expect = method1(DATA, True)
for method in methods[1:]:
    print(method(DATA, True) == expect)

if __name__ == "__main__":
    for round in range(1, 4):
        print('Round', round)
        for method in methods:
            t = min(timeit.repeat(lambda: method(DATA),
                                  number=NUM_ITERATION_TEST))
            print(method.__name__, t)
        print()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM