简体   繁体   English

提高For Loop Python的速度

[英]Improve the speed of a For Loop Python

I have a function that returns a dictionary. 我有一个返回字典的函数。 The function works by calculating values based on an array in a dataframe. 该函数通过基于数据帧中的数组计算值来工作。

The dataframe has about 1000,000 rows and looks like this: 数据框大约有1000,000行,如下所示:

                  col1                  
row1         [2, 3, 44, 89.6,...]           
row2         [10, 4, 33.3, 1.11,...]
row3         [3, 4, 3, 2.6, 5.9, 8, 10,...]  

My function takes in each array in each row, does some calculations and returns a dictionary based on these calculations. 我的函数接受每一行中的每个数组,进行一些计算,并根据这些计算返回一个字典。 However, it is very slow. 但是,它非常慢。 There is a lot of data to sift through which I appreciate but is there a way that I can improve the speed? 我有很多值得筛选的数据,但是有什么方法可以提高速度?

The Issues Dataframe is long. 问题数据框很长。 Each array can contain 100+ values. 每个数组可以包含100多个值。 Ranges from about 10-80. 范围从10-80。

My code looks like this: 我的代码如下所示:

list1 = []

for i in df.itertuples():
    list1.append(list(function(i.data).values()))

The idea here is that I loop through each row in 'df', apply my function to the 'data' column and append the results to a list, 'list1'. 这里的想法是,我遍历'df'中的每一行,将我的函数应用于'data'列,并将结果附加到列表'list1'。

Function Explained 功能说明

My function computes some pretty basic stuff. 我的函数计算了一些非常基本的东西。 It takes in an array as a parameter and calculates stuff based on that array eg how long is it, average value in the array, min and max of array. 它接受一个数组作为参数,并根据该数组计算填充量,例如,数组的长度,数组中的平均值,数组的最小值和最大值。 I compute 8 values and store them in a dictionary. 我计算8个值并将它们存储在字典中。 The last thing my function does is look at these computed values and add a final key to the dictionary in the form of a boolean. 我的函数所做的最后一件事是查看这些计算出的值,并以布尔值的形式向字典添加最终键。

Like I said in the comments, if your function is costly (reducing each row is the time consuming part of your code), then a first step is to use multiprocessing because it's easy to test. 就像我在评论中说的那样,如果您的函数很昂贵(减少每一行是代码的耗时部分),那么第一步就是使用multiprocessing因为它很容易测试。

Here is something you could try: 您可以尝试以下方法:

import time
from multiprocessing import Pool

def f(x):
  time.sleep(10*10**-6) # Faking complex computation
  return x

def seq_test(input_array):
  return list(map(f, input_array))

def par_test(input_array):
  pool = Pool(8)  #  "nproc --all" or "sysctl -n hw.ncpu" on osx
  return pool.map(f, input_array)

def run_test(test_function):
  test_size = 10*10**4
  test_input = [i for i in range(test_size)]

  t0 = time.time()
  result = test_function(test_input)
  t1 = time.time()

  print(f"{test_function.__name__}: {t1-t0:.3f}s")

run_test(seq_test)
run_test(par_test)

On my machine the parallel version runs about 7 times faster (quite close to the factor 8 we could hope for): 在我的机器上,并行版本的运行速度大约快7倍(非常接近我们希望的8倍):

seq_test: 2.131s
par_test: 0.300s

If that's not enough, the next step is to write function f in a different language, once again, what seems simpler here is to go for Cython . 如果这还不够,那么下一步就是用另一种语言编写函数f ,再一次,看上去更简单的方法是使用Cython But for discussing that we need to see what's inside your function. 但是,为了进行讨论,我们需要查看函数内部的内容。

I suggest change format of your data like: 我建议您更改数据格式,例如:

print (df)
                            col1
row1            [2, 3, 44, 89.6]
row2         [10, 4, 33.3, 1.11]
row3  [3, 4, 3, 2.6, 5.9, 8, 10]

from itertools import chain

df = pd.DataFrame({
    'idx' : df.index.repeat(df['col1'].str.len()),
    'col1' : list(chain.from_iterable(df['col1'].tolist()))
})
print (df)
     idx   col1
0   row1   2.00
1   row1   3.00
2   row1  44.00
3   row1  89.60
4   row2  10.00
5   row2   4.00
6   row2  33.30
7   row2   1.11
8   row3   3.00
9   row3   4.00
10  row3   3.00
11  row3   2.60
12  row3   5.90
13  row3   8.00
14  row3  10.00

And then aggregate your data: 然后汇总您的数据:

df1 = df.groupby('idx')['col1'].agg(['sum','mean','max','min'])
print (df1)
         sum       mean   max   min
idx                                
row1  138.60  34.650000  89.6  2.00
row2   48.41  12.102500  33.3  1.11
row3   36.50   5.214286  10.0  2.60

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM