简体   繁体   中英

Number of distinct items between two consecutive uses of an item in realtime

I'm working on an problem that finds the distance - the number of distinct items between two consecutive uses of an item in realtime. The input is read from a large file (~10G), but for illustration I'll use a small list.

from collections import OrderedDict
unique_dist = OrderedDict()
input = [1, 4, 4, 2, 4, 1, 5, 2, 6, 2]

for item in input:
    if item in unique_dist:
        indx = unique_dist.keys().index(item) # find the index
        unique_dist.pop(item)                 # pop the item
        size = len(unique_dist)               # find the size of the dictionary
        unique_dist[item] = size - indx       # update the distance value
    else:
        unique_dist[item] = -1                # -1 if it is new
print input
print unique_dist

As we see, for each item I first check if the item is already present in the dictionary, and if it is, I update the value of the distance or else I insert it at the end with the value -1. The problem is that this seems to be very inefficient as the size grows bigger. Memory isn't a problem, but the pop function seems to be. I say that because, just for the sake if I do:

for item in input:
        unique_dist[item] = random.randint(1,99999)

the program runs really fast. My question is, is there any way I could make my program more efficient(fast)?

EDIT:

It seems that the actual culprit is indx = unique_dist.keys().index(item) . When I replaced that with indx = 1 . The program was orders of magnitude faster.

According to a simple analysis I did with the cProfile module, the most expensive operations by far are OrderedDict.__iter__() and OrderedDict.keys() .

The following implementation is roughly 7 times as fast as yours (according to the limited testing I did).

  • It avoids the call to unique_dist.keys() by maintaining a list of items keys . I'm not entirely sure, but I think this also avoids the call to OrderedDict.__iter__() .
  • It avoids the call to len(unique_dist) by incrementing the size variable whenever necessary. (I'm not sure how expensive of an operation len(OrderedDict) is, but whatever)
def distance(input):
    dist= []
    key_set= set()
    keys= []
    size= 0
    for item in input:
        if item in key_set:
            index= keys.index(item)
            del keys[index]
            del dist[index]
            keys.append(item)
            dist.append(size-index-1)
        else:
            key_set.add(item)
            keys.append(item)
            dist.append(-1)
            size+= 1
    return OrderedDict(zip(keys, dist))

I modified @Rawing's answer to overcome the overhead caused by the lookup and insertion time taken by set data structure.

from random import randint
dist = {}
input = []
for x in xrange(1,10):
    input.append(randint(1,5))
keys = []
size = 0
for item in input:
    if item in dist:
        index = keys.index(item)
        del keys[index]
        keys.append(item)
        dist[item] = size-index-1
    else:
        keys.append(item)
        dist[item] = -1
        size += 1
print input
print dist

How about this:

from collections import OrderedDict
unique_dist = OrderedDict()
input = [1, 4, 4, 2, 4, 1, 5, 2, 6, 2]

for item in input:
    if item in unique_dist:
        indx = unique_dist.keys().index(item)
        #unique_dist.pop(item)                # dont't pop the item
        size = len(unique_dist)               # now the directory is one element to big
        unique_dist[item] = size - indx - 1   # therefor decrement the value here
    else:
        unique_dist[item] = -1                # -1 if it is new
print input
print unique_dist

[1, 4, 4, 2, 4, 1, 5, 2, 6, 2]
OrderedDict([(1, 2), (4, 1), (2, 2), (5, -1), (6, -1)])

Beware that the entries in unique_dist are now ordered by there first occurrence of the item in the input; yours were ordered by there last occurrence:

[1, 4, 4, 2, 4, 1, 5, 2, 6, 2]
OrderedDict([(4, 1), (1, 2), (5, -1), (6, -1), (2, 1)])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM