简体   繁体   中英

Best way to sort 1M records in Python

I have a service that runs that takes a list of about 1,000,000 dictionaries and does the following

myHashTable = {}
myLists = { 'hits':{}, 'misses':{}, 'total':{} }
sorted = { 'hits':[], 'misses':[], 'total':[] }
for item in myList:
  id = item.pop('id')
  myHashTable[id] = item
  for k, v in item.iteritems():
    myLists[k][id] = v

So, if I had the following list of dictionaries:

[ {'id':'id1', 'hits':200, 'misses':300, 'total':400},
  {'id':'id2', 'hits':300, 'misses':100, 'total':500},
  {'id':'id3', 'hits':100, 'misses':400, 'total':600}
]

I end up with

myHashTable =
{ 
  'id1': {'hits':200, 'misses':300, 'total':400},
  'id2': {'hits':300, 'misses':100, 'total':500},
  'id3': {'hits':100, 'misses':400, 'total':600}
}

and

myLists = 

    {
      'hits': {'id1':200, 'id2':300, 'id3':100},
      'misses': {'id1':300, 'id2':100, 'id3':400},
      'total': {'id1':400, 'id2':500, 'id3':600}
    }

I then need to sort all of the data in each of the myLists dictionaries.

What I doing currently is something like the following:

def doSort(key):
  sorted[key] = sorted(myLists[key].items(), key=operator.itemgetter(1), reverse=True)

which would yield, in the case of misses:
[('id3', 400), ('id1', 300), ('id2', 200)] 

This works great when I have up to 100,000 records or so, but with 1,000,000 it is taking at least 5 - 10 minutes to sort each with a total of 16 (my original list of dictionaries actually has 17 fields including id which is popped)

* EDIT * This service is a ThreadingTCPServer which has a method allowing a client to connect and add new data. The new data may include new records (meaning dictionaries with unique 'id's to what is already in memory) or modified records (meaning the same 'id' with different data for the other key value pairs

So, once this is running I would pass in

 [ {'id':'id1', 'hits':205, 'misses':305, 'total':480}, {'id':'id4', 'hits':30, 'misses':40, 'total':60}, {'id':'id5', 'hits':50, 'misses':90, 'total':20 ] 

I have been using dictionaries to store the data so that I don't end up with duplicates. After the dictionaries are updated with the new/modified data I resort each of them.

* END EDIT *

So, what is the best way for me to sort these? Is there a better method?

你可以从Guido找到这个相关的答案: 使用Python在2MB RAM中排序一百万个32位整数

What you really want is an ordered container, instead of an unordered one. That would implicitly sort the results as they're inserted. The standard data structure for this is a tree.

However, there doesn't seem to be one of these in Python. I can't explain that; this is a core, fundamental data type in any language. Python's dict and set are both unordered containers, which map to the basic data structure of a hash table. It should definitely have an optimized tree data structure; there are many things you can do with them that are impossible with a hash table, and they're quite tricky to implement well, so people generally don't want to be doing it themselves.

(There's also nothing mapping to a linked list, which also should be a core data type. No, a deque is not equivalent.)

I don't have an existing ordered container implementation to point you to (and it should probably be implemented natively, not in Python), but hopefully this will point you in the right direction.

A good tree implementation should support iterating across a range by value ("iterate all values from [2,100] in order"), find next/prev value from any other node in O(1), efficient range extraction ("delete all values in [2,100] and return them in a new tree"), etc. If anyone has a well-optimized data structure like this for Python, I'd love to know about it. (Not all operations fit nicely in Python's data model; for example, to get next/prev value from another value, you need a reference to a node, not the value itself.)

If you have a fixed number of fields, use tuples instead of dictionaries. Place the field you want to sort on in first position, and just use mylist.sort()

Others have provided some excellent advices, try them out.

As a general advice, in situations like that you need to profile your code. Know exactly where most of the time is spent. Bottlenecks hide well, in places you least expect them to be.
If there is a lot of number crunching involved then a JIT compiler like the (now-dead) psyco might also help. When processing takes minutes or hours 2x speed-up really counts.

This seems to be pretty fast.

raw= [ {'id':'id1', 'hits':200, 'misses':300, 'total':400},
    {'id':'id2', 'hits':300, 'misses':100, 'total':500},
    {'id':'id3', 'hits':100, 'misses':400, 'total':600}
]

hits= [ (r['hits'],r['id']) for r in raw ]
hits.sort()

misses = [ (r['misses'],r['id']) for r in raw ]
misses.sort()

total = [ (r['total'],r['id']) for r in raw ]
total.sort()

Yes, it makes three passes through the raw data. I think it's faster than pulling out the data in one pass.

Instead of trying to keep your list ordered, maybe you can get by with a heap queue. It lets you push any item, keeping the 'smallest' one at h[0] , and popping this item (and 'bubbling' the next smallest) is an O(nlogn) operation.

so, just ask yourself:

  • do i need the whole list ordered all the time? : use an ordered structure (like Zope's BTree package, as mentioned by Ealdwulf)

  • or the whole list ordered but only after a day's work of random insertions?: use sort like you're doing, or like S.Lott's answer

  • or just a few 'smallest' items at any moment? : use heapq

sorted(myLists[key], key=mylists[key].get, reverse=True)

应该节省你一些时间,虽然不是很多。

I would look into using a different sorting algorithm. Something like a Merge Sort might work. Break the list up into smaller lists and sort them individually. Then loop.

Pseudo code:

list1 = []  // sorted separately
list2 = []  // sorted separately

// Recombine sorted lists
result = []
while (list1.hasMoreElements || list2.hasMoreElements):
   if (! list1.hasMoreElements):
       result.addAll(list2)
       break
   elseif (! list2.hasMoreElements):
       result.AddAll(list1)
       break

   if (list1.peek < list2.peek):
      result.add(list1.pop)
   else:
      result.add(list2.pop)

Glenn Maynard is correct that a sorted mapping would be appropriate here. This is one for python: http://wiki.zope.org/ZODB/guide/node6.html#SECTION000630000000000000000

I've done some quick profiling of both the original way and SLott's proposal. In neither case does it take 5-10 minutes per field. The actual sorting is not the problem. It looks like most of the time is spent in slinging data around and transforming it. Also, my memory usage is skyrocketing - my python is over 350 megs of ram! are you sure you're not using up all your ram and paging to disk? Even with my crappy 3 year old power saving processor laptop, I am seeing results way less than 5-10 minutes per key sorted for a million items. What I can't explain is the variability in the actual sort() calls. I know python sort is extra good at sorting partially sorted lists, so maybe his list is getting partially sorted in the transform from the raw data to the list to be sorted.

Here's the results for slott's method:

done creating data
done transform.  elapsed: 16.5160000324
sorting one key slott's way takes 1.29699993134

here's the code to get those results:

starttransform = time.time()
hits= [ (r['hits'],r['id']) for r in myList ]
endtransform = time.time()
print "done transform.  elapsed: " + str(endtransform - starttransform)
hits.sort()
endslottsort = time.time()
print "sorting one key slott's way takes " + str(endslottsort - endtransform)

Now the results for the original method, or at least a close version with some instrumentation added:

done creating data
done transform.  elapsed: 8.125
about to get stuff to be sorted 
done getting data. elapsed time: 37.5939998627
about to sort key hits
done  sorting on key <hits> elapsed time: 5.54699993134

Here's the code:

for k, v in myLists.iteritems():
    time1 = time.time()
    print "about to get stuff to be sorted "
    tobesorted = myLists[k].items()
    time2 = time.time()
    print "done getting data. elapsed time: " + str(time2-time1)
    print "about to sort key " + str(k) 
    mysorted[k] = tobesorted.sort( key=itemgetter(1))
    time3 = time.time()
    print "done  sorting on key <" + str(k) + "> elapsed time: " + str(time3-time2)

Honestly, the best way is to not use Python. If performance is a major concern for this, use a faster language.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM