简体   繁体   English

循环遍历多个元组列表以查找每个元组列表的最大值的快速方法

[英]Fast method to cycle through multiple lists of tuples to find max of each tuple list

I have tens of thousands of lists of tuples, where each tuple within a list consists of an (int, float) pair.我有数以万计的元组列表,其中列表中的每个元组都由一个(int, float)对组成。 I want to be able to cycle through all the lists of tuples to find the (int, float) pair where the float is the maximum value of the float in the list of tuples.我希望能够循环遍历所有元组列表以找到(int, float)对,其中float是元组列表中浮点数的最大值。 Consider several lists of tuples:考虑几个元组列表:

[
[(0, 0.3792), (3, 0.5796)],
[0, 0.9365), (1, 0.0512), (18, 0.0123),
[(13, 0.8642)],
[(0, 0.6249), (1, 0.01), (2, 0.01), (3, 0.01), (4, 0.01), (5, 0.01)]
]

For each list of tuples, I want to find the pair where the second number is maximized (eg, for the first list, I want (3, 0.5796) ; for the 4th item, (0, 0.6249) should be returned).对于每个元组列表,我想找到第二个数字最大化的对(例如,对于第一个列表,我想要(3, 0.5796) ;对于第四个项目,应该返回(0, 0.6249) )。 My current approach is to turn the tuples into numpy arrays and then find argmax and max:我目前的方法是将元组变成 numpy arrays 然后找到 argmax 和 max:

def get_max(doc: List[Tuple[int, float]]) -> Tuple[int, float]:
            
   topic_prob_array = np.array(doc, dtype=np.dtype('int,float'))
   return topic_prob_array['f0'][np.argmax(topic_prob_array['f1'])], np.max(topic_prob_array['f1'])

I was hoping to turn this into a numpy vectorized function (via vec_func = np.vectorized(get_max, otypes=[int,float]) or numpy ufunc (via vec_func = np.fromfunc(get_max, nin=1, nout=1) . I not sure if I am formatting the input and output correctly. My reasoning is that I am sending in a single list of tuples and return a single tuple, hence nin=1, nout=1 . However, I have not been able to successfully get a vectorized version of this to run. I was hoping to turn this into a numpy vectorized function (via vec_func = np.vectorized(get_max, otypes=[int,float]) or numpy ufunc (via vec_func = np.fromfunc(get_max, nin=1, nout=1) .我不确定我是否正确格式化输入和 output。我的理由是我发送一个元组列表并返回一个元组,因此nin=1, nout=1 。但是,我无法成功获得运行的矢量化版本。

I also tried a solution without relying on numpy :我还尝试了一个不依赖numpy的解决方案:

def get_max(doc: List[Tuple[int, float]]) -> Tuple[int, float]:

   ids, probabilities = zip(*doc)
   return ids[np.argmax(probabilities)], np.max(probabilities)

Both take about the same amount of time to run.两者都需要大约相同的时间来运行。 For my list of about 80k, this takes about 1 minute, 10 seconds for both implementations.对于我大约 80k 的列表,这两种实现都需要大约 1 分 10 秒。 I'd really like to get this down if it's possible.如果可能的话,我真的很想把它弄下来。

The optimized non- numpy solution to this is:对此的优化非numpy解决方案是:

from operator import itemgetter

get1 = itemgetter(1)

all_lists = [...]  # Whatever your actual list of list of tuples comes from

all_maxes = [max(lst, key=get1) for lst in all_lists]

numpy isn't likely to gain you much, since the work done is relatively cheap, and if you're only converting to numpy arrays for a single operation, the scope of benefit is smaller. numpy isn't likely to gain you much, since the work done is relatively cheap, and if you're only converting to numpy arrays for a single operation, the scope of benefit is smaller.

Do you need to use numpy for this?您需要为此使用numpy吗? We can take a functional approach and map the max function with a custom key across the whole data set.我们可以采用功能方法, map max function 使用自定义key覆盖整个数据集。

from functools import partial
from operator import itemgetter

snd = itemgetter(1)
p = partial(max, key=snd)
list(map(p, data))
>>> [(3, 0.5796), (0, 0.9365), (13, 0.8642), (0, 0.6249)]

Then a quick timing across 80K random tuples from your original dataset.然后对原始数据集中的 80K 随机元组进行快速计时。

from random import choice

result = []
for _ in range(80_000):
    result.append(choice(data))

%timeit list(map(p, result))
42.2 ms ± 686 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Like @gold_cy mentioned, I'm not sure if you're looking for a numpy answer.就像提到的@gold_cy 一样,我不确定您是否正在寻找numpy答案。 A non- numpy answer could be:numpy答案可能是:

list_tuple = [
    [(0, 0.3792), (3, 0.5796)],
    [(0, 0.9365), (1, 0.0512), (18, 0.0123)],
    [(13, 0.8642)],
    [(0, 0.6249), (1, 0.01), (2, 0.01), (3, 0.01), (4, 0.01), (5, 0.01)]
]

[sorted(tup, key=lambda x: x[1], reverse=True).pop(0) for tup in list_tuple]
>>> [(3, 0.5796), (0, 0.9365), (13, 0.8642), (0, 0.6249)]
In [462]: alist
Out[462]: 
[[(0, 0.3792), (3, 0.5796)],
 [(0, 0.9365), (1, 0.0512), (18, 0.0123)],
 [(13, 0.8642)],
 [(0, 0.6249), (1, 0.01), (2, 0.01), (3, 0.01), (4, 0.01), (5, 0.01)]]
In [463]: blist = alist*10000    # bigger test list

Playing around with alternatives, I found this "brute force" function is fastest (though not by much):玩弄替代品,我发现这个“蛮力” function 是最快的(虽然不是很多):

def get_max3(doc):
    m = doc[0]
    for i in doc[1:]:
        if i[1]>m[1]: m=i
    return m

For the small list, the list comprehension is slightly faster, for the big list, the map version has the edge - but not by much.对于小列表,列表理解稍快,对于大列表,map 版本有优势 - 但幅度不大。

In [465]: [get_max3(i) for i in alist]
Out[465]: [(3, 0.5796), (0, 0.9365), (13, 0.8642), (0, 0.6249)]

In [466]: timeit [get_max3(i) for i in alist]
1.9 µs ± 51.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [467]: timeit list(map(get_max3,blist))
15 ms ± 7.77 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Versions using numpy are all much slower;使用numpy的版本都慢得多; it takes time to convert the list of tuples into a numpy array (even structured array).将元组列表转换为 numpy 数组(甚至是结构化数组)需要时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM