简体   繁体   English

我该如何优化这个 python 循环?

[英]how could i optimise this python loop?

I am running this code on a large csv file (1.5 million rows).我在一个大型 csv 文件(150 万行)上运行此代码。 Is there a way to optimise?有没有办法优化?

df is a pandas dataframe. df 是 pandas dataframe。 I take a row and want to know what happens 1st in the 1000 folowing rows:我排了一行,想知道在接下来的 1000 行中发生了什么:

I find my value + 0.0004 or i find my value - 0.0004我找到我的价值 + 0.0004 或者我找到我的价值 - 0.0004

result = []
for row in range(len(df)-1000):
    start = df.get_value(row,'A')
    win = start + 0.0004
    lose = start - 0.0004
    for n in range(1000):
        ref = df.get_value(row + n,'B')
        if ref > win:
            result.append(1)
            break
        elif ref <= lose:
            result.append(-1)
            break
        elif n==999 :
            result.append(0)

the dataframe is like: dataframe 就像:

         timestamp           A         B
0   20190401 00:00:00.127  1.12230  1.12236
1   20190401 00:00:00.395  1.12230  1.12237
2   20190401 00:00:00.533  1.12229  1.12234
3   20190401 00:00:00.631  1.12228  1.12233
4   20190401 00:00:01.019  1.12230  1.12234
5   20190401 00:00:01.169  1.12231  1.12236 

the result is: result[0,0,1,0,0,1,-1,1,…]结果是:结果[0,0,1,0,0,1,-1,1,...]

this is working but takes a long time to process on such large files.这是有效的,但需要很长时间来处理如此大的文件。

To generate values for the "first outlier", define the following function:要为“第一个异常值”生成值,请定义以下 function:

def firstOutlier(row, dltRow = 4, dltVal = 0.1):
    ''' Find the value for the first "outlier". Parameters:
    row    - the current row
    dltRow - number of rows to check, starting from the current
    dltVal - delta in value of "B", compared to "A" in the current row
    '''
    rowInd = row.name                        # Index of the current row
    df2 = df.iloc[rowInd : rowInd + dltRow]  # "dltRow" rows from the current
    outliers = df2[abs(df2.B - row.A) >= dlt]
    if outliers.index.size == 0:  # No outliers within the range of rows
        return 0
    return int(np.sign(outliers.iloc[0].B - row.A))

Then apply it to each row:然后将其应用于每一行:

df.apply(firstOutlier, axis=1)

This function relies on the fact that the DataFrame has the index consisting of consecutive numbers, starting from 0, so that having ind - the index of any row we can access it calling df.iloc[ind] and a slice of n rows, starting from this row, calling df.iloc[ind: ind + n] .这个 function 依赖于这样一个事实,即 DataFrame 的索引由从 0 开始的连续数字组成,因此具有ind - 我们可以访问它的任何行的索引,调用df.iloc[ind]n行的切片,开始从这一行开始,调用df.iloc[ind: ind + n]

For my test, I set the default values of parameters to:对于我的测试,我将参数的默认值设置为:

  • dltRow = 4 - look at 4 rows, starting from the current one, dltRow = 4 - 查看4行,从当前行开始,
  • dltVal = 0.1 - look for rows with B column "distant by" 0.1 or more from A in the current row. dltVal = 0.1 - 查找当前行中B列“距离” 0.1或更多行。

My test DataFrame was:我的测试 DataFrame 是:

      A     B
0  1.00  1.00
1  0.99  1.00
2  1.00  0.80
3  1.00  1.05
4  1.00  1.20
5  1.00  1.00
6  1.00  0.80
7  1.00  1.00
8  1.00  1.00

The result (for my data and default values of parameters) was:结果(对于我的数据和参数的默认值)是:

0   -1
1   -1
2   -1
3    1
4    1
5   -1
6   -1
7    0
8    0
dtype: int64

For your needs, change default values of params to 1000 and 0.0004 respectively.根据您的需要,将 params 的默认值分别更改为10000.0004

The idea is to loop through A and B while maintaining a sorted list of A values.这个想法是遍历AB ,同时保持A值的排序列表。 Then, for each B , find the highest A that loses and the lowest A that wins.然后,对于每个B ,找到输的最高A和赢的最低A Since it's a sorted list it's O(log(n)) to search.由于它是一个排序列表,因此要搜索O(log(n)) Only those A 's that have index in the last 1000 are used for setting the result vector.只有那些在最后 1000 中具有索引的A用于设置结果向量。 After that the A 's that are no longer waiting for a B are removed from this sorted list to keep it small.之后,不再等待BA将从此排序列表中删除以保持较小。

import numpy as np
import bisect
import time

N = 10
M = 3
#N=int(1e6)
#M=int(1e3)
thresh = 0.4

A = np.random.rand(N)
B = np.random.rand(N)
result = np.zeros(N)

l = []

t_start = time.time()

for i in range(N):
    a = (A[i],i)
    bisect.insort(l,a)
    b = B[i]
    firstLoseInd = bisect.bisect_left(l,(b+thresh,-1))
    lastWinInd = bisect.bisect_right(l,(b-thresh,-1))
    for j in range(lastWinInd):
        curInd = l[j][1]
        if curInd > i-M:
            result[curInd] = 1
    for j in range(firstLoseInd,len(l)):
        curInd = l[j][1]
        if curInd > i-M:
            result[curInd] = -1
    del l[firstLoseInd:]
    del l[:lastWinInd]

t_done = time.time()

print(A)
print(B)
print(result)
print(t_done - t_start)

This is a sample output:这是一个示例 output:

[ 0.22643589  0.96092354  0.30098532  0.15569044  0.88474775  0.25458535
  0.78248271  0.07530432  0.3460113   0.0785128 ]
[ 0.83610433  0.33384085  0.51055061  0.54209458  0.13556121  0.61257179
  0.51273686  0.54850825  0.24302884  0.68037965]
[ 1. -1.  0.  1. -1.  0. -1.  1.  0.  1.]

For N = int(1e6) and M = int(1e3) it took about 3.4 seconds on my computer.对于N = int(1e6)M = int(1e3)在我的计算机上花费了大约 3.4 秒。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM