简体   繁体   English

在Generator和Pandas数据框中并行化或加快计算速度

[英]Parallelizing or otherwise speeding up calculation within generator and pandas dataframe

I am doing a calculation on permutations of things from a generator created by itertools. 我正在根据itertools创建的生成器对事物的排列进行计算。 I have a piece of code in this form (this is a dummy example): 我有一段这样的代码(这是一个虚拟的示例):

import itertools
import pandas as pd

combos = itertools.permutations('abcdefghi',2)
results = []
i=0

for combo in combos:
    i+=1 #this line is actually other stuff that's expensive
    results.append([combo[0]+'-'+combo[1],i])

rdf = pd.DataFrame(results, columns=['combo','value'])

Except in the real code, 除了真实的代码

  • there are several hundred thousand permutations 有数十万种排列
  • instead of i+=1 I am opening files and getting results of clf.predict where clf is a classifier trained in scikit-learn 而不是i+=1我正在打开文件并获取clf.predict结果,其中clf是在scikit-learn中训练的分类器
  • in place of i I'm storing a value from that prediction 代替i我存储从预测值

I think the combo[0]+'-'+combo[1] is trivial though. 我认为combo[0]+'-'+combo[1]实在是微不足道。

This takes too long. 这需要太长时间。 What should I do to make it faster? 我应该怎么做才能使其更快? Such as: 如:

1) writing better code (maybe I should initialize results with the proper length instead of using append but how much will that help? and what's the best way to do that when I don't know the length before iterating through combs ?) 1)编写更好的代码(也许我应该使用适当的长度来初始化results ,而不是使用append来初始化,但这有什么帮助?当我在遍历combs之前不知道长度时,这样做的最佳方法是什么?)

2) initializing a pandas dataframe instead of a list and using apply ? 2)初始化一个pandas数据框而不是一个列表并使用apply

3) using cython in pandas? 3)在熊猫中使用cython吗? Total newbie to this. 总计新手。

4) parallelizing ? 4) 并行化 I think I probably need to do this, but again, total newbie, and I don't know whether it's better to do it within a list or a pandas dataframe. 我想我可能需要这样做,但是总的来说,这是新手,而且我不知道在列表中还是在熊猫数据框中这样做是否更好。 I understand I would need to iterate over the generator and initialize some kind of container before parallelizing . 我知道我需要遍历生成器并在并行化之前初始化某种容器。

Which combination of these options would be best and how can I put it together? 这些选项的最佳组合是哪种?如何将它们组合在一起?

The append operation in pandas and for loop are slow. 熊猫和for循环中的append操作很慢。 This code avoids using it. 此代码避免使用它。

import itertools
import pandas as pd

combos = itertools.permutations('abcdefghi',2)
combo_values = [('-'.join(x[1]), x[0]) for x in enumerate(combos, 1)]

rdf = pd.DataFrame({'combos': [x[0] for x in combo_values],
                    'value': [x[1] for x in combo_values]})

You can do this for each file and dataframe that you have then use pd.concat to quickly generate results thereafter. 您可以对每个文件和数据执行此操作,然后使用pd.concat快速生成结果。 You can also add the enumeration of the permutations afterward if you want. 如果需要,还可以在之后添加排列的枚举。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM