合并和排序元组列表的最快方法是什么？

Question

我有一个元组列表的列表。 每个元组具有(string,int)的形式，例如

lst = list()
lst.append([('a',5),('c',10),('d',3),('b',1)])
lst.append([('c',14),('f',4),('b',1)])
lst.append([('d',22),('f',2)])

将int视为不同文本块中每个字符串的计数。

我需要做的是生成一个前N出现的字符串的列表以及它们的累积计数。 因此，在上面的示例中， a出现5次， b出现两次， c出现24次，等等。如果N=2 ，那么我将必须生成一对并行列表['d','c']和[25,24]或元组列表[('d',25),('c',24)] 。 我需要尽快做。 我的机器有很多RAM，因此内存不是问题。

我有这个实现：

import numpy as np
def getTopN(lst,N):

    sOut = []
    cOut = []

    for l in lst:
        for tpl in l:
            s = tpl[0]
            c = tpl[1]

            try:
                i = sOut.index(s)
                cOut[i] += c
            except:
                sOut.append(s)
                cOut.append(c)

    sIndAsc = np.argsort(cOut).tolist()
    sIndDes = sIndAsc[::-1]
    cOutDes = [cOut[sir] for sir in sIndDes]
    sOutDes = [sOut[sir] for sir in sIndDes]

    return sOutDes[0:N],cOutDes[0:N]

有一种更好的方法，但是那会是什么呢？

Answer 1

使用collections.Counter ：

import collections
c = collections.Counter()
for x in lst:
    c.update(dict(x))
print(c.most_common(2))

输出：

[('d', 25), ('c', 24)]

Counter是具有一些附加功能的字典，因此查找值并将其添加到当前计数中确实非常快。 dict(x)只会将元组列表转换为常规dict，将字符串映射为数字，然后Counter的update方法将添加这些计数（而不是像常规dict那样仅覆盖值）。

或者，使用defaultdict的更手动的方法：

c = collections.defaultdict(int)
for x, y in (t for x in lst for t in x):
    c[x] += y
return [(k, c[k]) for k in sorted(c, key=c.get, reverse=True)][:2]

正如John在评论中指出的那样， defaultdict的确快得多：

In [2]: %timeit with_counter()
10000 loops, best of 3: 17.3 µs per loop
In [3]: %timeit with_dict()
100000 loops, best of 3: 4.97 µs per loop

Answer 2

另一种选择，使用numpy ：

# make a flattened numpy version of the list
lst_np = np.asarray([item for sublist in lst for item in sublist])

# split into the two columns
chars = lst_np[:,0]
counts = lst_np[:,1].astype('int')

# get unique characters, and compute total counts for each
[unique_chars, unique_inds] = np.unique(chars, return_inverse=True)
unique_counts = np.asarray([np.sum(counts[unique_inds==x])
    for x in range(len(unique_chars))])

这将使您获得unique_counts中每个唯一字符（ unique_chars ）的计数（ unique_counts ），而不仅仅是前N 。 这应该很快，但可能会占用大量内存。

合并和排序元组列表的最快方法是什么？

问题描述

2 个解决方案

解决方案1
6 已采纳 2015-11-25 13:31:33

解决方案2
0 2015-11-25 13:38:02

合并和排序元组列表的最快方法是什么？

问题描述

2 个解决方案

解决方案1 6 已采纳 2015-11-25 13:31:33

解决方案2 0 2015-11-25 13:38:02

解决方案1
6 已采纳 2015-11-25 13:31:33

解决方案2
0 2015-11-25 13:38:02