[英]Parallelizing or otherwise speeding up calculation within generator and pandas dataframe
I am doing a calculation on permutations of things from a generator created by itertools. 我正在根据itertools创建的生成器对事物的排列进行计算。 I have a piece of code in this form (this is a dummy example): 我有一段这样的代码(这是一个虚拟的示例):
import itertools
import pandas as pd
combos = itertools.permutations('abcdefghi',2)
results = []
i=0
for combo in combos:
i+=1 #this line is actually other stuff that's expensive
results.append([combo[0]+'-'+combo[1],i])
rdf = pd.DataFrame(results, columns=['combo','value'])
Except in the real code, 除了真实的代码
i+=1
I am opening files and getting results of clf.predict
where clf
is a classifier trained in scikit-learn 而不是i+=1
我正在打开文件并获取clf.predict
结果,其中clf
是在scikit-learn中训练的分类器 i
I'm storing a value from that prediction 代替i
我存储从预测值 I think the combo[0]+'-'+combo[1]
is trivial though. 我认为combo[0]+'-'+combo[1]
实在是微不足道。
This takes too long. 这需要太长时间。 What should I do to make it faster? 我应该怎么做才能使其更快? Such as: 如:
1) writing better code (maybe I should initialize results
with the proper length instead of using append
but how much will that help? and what's the best way to do that when I don't know the length before iterating through combs
?) 1)编写更好的代码(也许我应该使用适当的长度来初始化results
,而不是使用append
来初始化,但这有什么帮助?当我在遍历combs
之前不知道长度时,这样做的最佳方法是什么?)
2) initializing a pandas dataframe instead of a list and using apply
? 2)初始化一个pandas数据框而不是一个列表并使用apply
?
3) using cython in pandas? 3)在熊猫中使用cython吗? Total newbie to this. 总计新手。
4) parallelizing ? 4) 并行化 ? I think I probably need to do this, but again, total newbie, and I don't know whether it's better to do it within a list or a pandas dataframe. 我想我可能需要这样做,但是总的来说,这是新手,而且我不知道在列表中还是在熊猫数据框中这样做是否更好。 I understand I would need to iterate over the generator and initialize some kind of container before parallelizing . 我知道我需要遍历生成器并在并行化之前初始化某种容器。
Which combination of these options would be best and how can I put it together? 这些选项的最佳组合是哪种?如何将它们组合在一起?
The append
operation in pandas and for
loop are slow. 熊猫和for
循环中的append
操作很慢。 This code avoids using it. 此代码避免使用它。
import itertools
import pandas as pd
combos = itertools.permutations('abcdefghi',2)
combo_values = [('-'.join(x[1]), x[0]) for x in enumerate(combos, 1)]
rdf = pd.DataFrame({'combos': [x[0] for x in combo_values],
'value': [x[1] for x in combo_values]})
You can do this for each file and dataframe that you have then use pd.concat to quickly generate results thereafter. 您可以对每个文件和数据框执行此操作,然后使用pd.concat快速生成结果。 You can also add the enumeration of the permutations afterward if you want. 如果需要,还可以在之后添加排列的枚举。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.