將語料庫字典排序為OrderedDict的最快方法-Python

Question

給定這樣的語料庫/文本：

Resumption of the session
I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999 , and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period .
Although , as you will have seen , the dreaded &apos; millennium bug &apos; failed to materialise , still the people in a number of countries suffered a series of natural disasters that truly were dreadful .
You have requested a debate on this subject in the course of the next few days , during this part @-@ session .
In the meantime , I should like to observe a minute &apos; s silence , as a number of Members have requested , on behalf of all the victims concerned , particularly those of the terrible storms , in the various countries of the European Union .

我可以簡單地這樣做以獲得具有單詞頻率的字典：

>>> word_freq = Counter()
>>> for line in text.split('\n'):
...     for word in line.split():
...             word_freq[word]+=1
...

但是，如果要實現從最高頻率到最低頻率的有序字典，我將必須這樣做：

>>> from collections import OrderedDict
>>> sorted_word_freq = OrderedDict()
>>> for word, freq in word_freq.most_common():
...     sorted_word_freq[word] = freq
...

想象一下，我在Counter對象中有10億個鍵，通過most_common()迭代將具有一次遍歷語料庫（非唯一實例）和詞匯表（唯一鍵）的復雜性。

注意： Counter.most_common()將調用臨時sorted() ，請參閱https://hg.python.org/cpython/file/e38470b49d3c/Lib/collections.py#l472

鑒於此，我已經看到了以下使用numpy.argsort()代碼：

>>> import numpy as np
>>> words = word_freq.keys()
>>> freqs = word_freq.values()
>>> sorted_word_index = np.argsort(freqs) # lowest to highest
>>> sorted_word_freq_with_numpy = OrderedDict()
>>> for idx in reversed(sorted_word_index):
...     sorted_word_freq_with_numpy[words[idx]] = freqs[idx]
...

哪個更快？

還有其他更快的方法可以從Counter獲得這樣的OrderedDict嗎？

除了OrderedDict ，還有其他python對象實現相同的已排序鍵值對嗎？

假定內存不是問題。 假設有120 GB的RAM，那么保留10億個鍵值對應該沒有什么問題呢？ 假設10億個鍵的每個鍵平均20個字符，每個值一個整數。

Answer 1

Pandas中的Series對象是一組鍵值對（可以具有非唯一鍵）的數組，可能對此感興趣。 它具有一種按值sort方法，並在Cython中實現。 這是一個排序長度為一百萬的數組的示例：

In [39]:
import pandas as pd
import numpy as np

arr = np.arange(1e6)
np.random.shuffle(arr)
s = pd.Series(arr, index=np.arange(1e6))
%timeit s.sort()
%timeit sorted(arr)

1 loops, best of 3: 85.8 ms per loop
1 loops, best of 3: 1.15 s per loop

給定一個普通的Python dict您可以通過調用以下命令來構建一個Series

my_series = pd.Series(my_dict)

然后按值排序

my_series.sort()

Answer 2

提高速度的第一步是以最佳方式填充計數器。

例如，使用您的txt （802個字符）。

mycounter=Counter(txt.split())

產生與word_counter相同的東西，但是時間是1/3。

或者，如果您必須逐行讀取文件中的文本，請使用：

word_freq=Counter()
for line in txt.splitlines():
    word_freq.update(line.split())

同樣，可以創建沒有循環的有序字典：

mydict = OrderedDict(sorted(mycounter.items(), key=operator.itemgetter(1), reverse=True))

在這里，我以與most_common相同的方式調用sorted （根據您的鏈接）。 我將排序項目列表直接傳遞給OrderedDict創建者。

當我在mycounter中ipython ，我按排序順序獲得了值：

In [160]: mycounter
Out[160]: Counter({'the': 13, ',': 10, 'of': 9, 'a': 7, '.': 4, 'in': 4, 'to': 3, 'have': 3, 'session': 3, '&apos;': 3, 'on': 3, 'you': 3, 'I': 3, 'that': 2, 'requested': 2, 'like': 2, 'European': 2, 'this': 2, 'countries': 2, 'as': 2, 'number': 2, 's': 1, 'various': 1, 'wish': 1, 'will': 1, 'Parliament': 1, 'meantime': 1, 'Resumption': 1, 'natural': 1, 'days': 1, 'debate': 1, 'You': 1, 'Members': 1, 'next': 1, '@-@': 1, 'hope': 1, 'enjoyed': 1, 'December': 1, 'victims': 1, 'particularly': 1, 'millennium': 1, .... 'behalf': 1, 'were': 1, 'failed': 1})

那是因為它的__repr__方法調用了most_common 。 再次，這是從您的鏈接。

items = ', '.join(map('%r: %r'.__mod__, self.most_common()))

在進一步測試中，我看到直接sorted調用無法節省時間：

In [166]: timeit mycounter.most_common()
10000 loops, best of 3: 31.1 µs per loop

In [167]: timeit sorted(mycounter.items(),key=operator.itemgetter(1),reverse=True)
10000 loops, best of 3: 30.5 µs per loop

In [168]: timeit OrderedDict(mycounter.most_common())
1000 loops, best of 3: 225 µs per loop

在這種情況下，直接加載字典也不會節省時間。 您的迭代也是如此：

In [174]: %%timeit 
   .....: sorteddict=OrderedDict()
   .....: for word,freq in word_freq.most_common():
    sorteddict[word]=freq
   .....: 
1000 loops, best of 3: 224 µs per loop

對於此示例，使用np.argsort沒有幫助（按時間）。 僅調用argsort比most_common慢。

In [178]: timeit np.argsort(list(mycounter.values()))
10000 loops, best of 3: 34.2 µs per loop

大部分時間是將列表轉換為數組x=np.array(list(mycounter.values())) 。 np.argsort(x)更快。 許多numpy功能都是如此。 在數組上運行時， numpy速度很快。 但是將列表轉換為數組時會產生很多開銷。

我可以通過numpy在一行中創建OrderedDict：

OrderedDict(np.sort(np.array(list(mycounter.items()), dtype='a12,i'), order='f1')[::-1])

或分段：

lla = np.array(list(mycounter.items()),dtype='a12,i')
lla.sort(order='f1')
OrderedDict(lla[::-1])

我正在從items()創建一個結構化的數組，按第二個字段對其進行排序，然后制作字典。 雖然沒有節省時間。 有關使用order對結構化數組進行排序的另一個最新示例，請參見https://stackoverflow.com/a/31837513/901925 。

將語料庫字典排序為OrderedDict的最快方法-Python

問題描述

2 個解決方案

解決方案1
3 已采納 2015-08-08 17:42:53

解決方案2
2 2015-08-08 16:51:21

將語料庫字典排序為OrderedDict的最快方法-Python

問題描述

2 個解決方案

解決方案1 3 已采納 2015-08-08 17:42:53

解決方案2 2 2015-08-08 16:51:21

解決方案1
3 已采納 2015-08-08 17:42:53

解決方案2
2 2015-08-08 16:51:21