[英]how could i optimise this python loop?
I am running this code on a large csv file (1.5 million rows).我在一个大型 csv 文件(150 万行)上运行此代码。 Is there a way to optimise?有没有办法优化?
df is a pandas dataframe. df 是 pandas dataframe。 I take a row and want to know what happens 1st in the 1000 folowing rows:我排了一行,想知道在接下来的 1000 行中发生了什么:
I find my value + 0.0004 or i find my value - 0.0004我找到我的价值 + 0.0004 或者我找到我的价值 - 0.0004
result = []
for row in range(len(df)-1000):
start = df.get_value(row,'A')
win = start + 0.0004
lose = start - 0.0004
for n in range(1000):
ref = df.get_value(row + n,'B')
if ref > win:
result.append(1)
break
elif ref <= lose:
result.append(-1)
break
elif n==999 :
result.append(0)
the dataframe is like: dataframe 就像:
timestamp A B
0 20190401 00:00:00.127 1.12230 1.12236
1 20190401 00:00:00.395 1.12230 1.12237
2 20190401 00:00:00.533 1.12229 1.12234
3 20190401 00:00:00.631 1.12228 1.12233
4 20190401 00:00:01.019 1.12230 1.12234
5 20190401 00:00:01.169 1.12231 1.12236
the result is: result[0,0,1,0,0,1,-1,1,…]结果是:结果[0,0,1,0,0,1,-1,1,...]
this is working but takes a long time to process on such large files.这是有效的,但需要很长时间来处理如此大的文件。
To generate values for the "first outlier", define the following function:要为“第一个异常值”生成值,请定义以下 function:
def firstOutlier(row, dltRow = 4, dltVal = 0.1):
''' Find the value for the first "outlier". Parameters:
row - the current row
dltRow - number of rows to check, starting from the current
dltVal - delta in value of "B", compared to "A" in the current row
'''
rowInd = row.name # Index of the current row
df2 = df.iloc[rowInd : rowInd + dltRow] # "dltRow" rows from the current
outliers = df2[abs(df2.B - row.A) >= dlt]
if outliers.index.size == 0: # No outliers within the range of rows
return 0
return int(np.sign(outliers.iloc[0].B - row.A))
Then apply it to each row:然后将其应用于每一行:
df.apply(firstOutlier, axis=1)
This function relies on the fact that the DataFrame has the index consisting of consecutive numbers, starting from 0, so that having ind - the index of any row we can access it calling df.iloc[ind]
and a slice of n rows, starting from this row, calling df.iloc[ind: ind + n]
.这个 function 依赖于这样一个事实,即 DataFrame 的索引由从 0 开始的连续数字组成,因此具有ind - 我们可以访问它的任何行的索引,调用df.iloc[ind]
和n行的切片,开始从这一行开始,调用df.iloc[ind: ind + n]
。
For my test, I set the default values of parameters to:对于我的测试,我将参数的默认值设置为:
dltRow = 4
- look at 4 rows, starting from the current one, dltRow = 4
- 查看4行,从当前行开始,dltVal = 0.1
- look for rows with B column "distant by" 0.1 or more from A in the current row. dltVal = 0.1
- 查找当前行中B列“距离” 0.1或更多的行。My test DataFrame was:我的测试 DataFrame 是:
A B
0 1.00 1.00
1 0.99 1.00
2 1.00 0.80
3 1.00 1.05
4 1.00 1.20
5 1.00 1.00
6 1.00 0.80
7 1.00 1.00
8 1.00 1.00
The result (for my data and default values of parameters) was:结果(对于我的数据和参数的默认值)是:
0 -1
1 -1
2 -1
3 1
4 1
5 -1
6 -1
7 0
8 0
dtype: int64
For your needs, change default values of params to 1000 and 0.0004 respectively.根据您的需要,将 params 的默认值分别更改为1000和0.0004 。
The idea is to loop through A
and B
while maintaining a sorted list of A
values.这个想法是遍历A
和B
,同时保持A
值的排序列表。 Then, for each B
, find the highest A
that loses and the lowest A
that wins.然后,对于每个B
,找到输的最高A
和赢的最低A
Since it's a sorted list it's O(log(n))
to search.由于它是一个排序列表,因此要搜索O(log(n))
。 Only those A
's that have index in the last 1000 are used for setting the result vector.只有那些在最后 1000 中具有索引的A
用于设置结果向量。 After that the A
's that are no longer waiting for a B
are removed from this sorted list to keep it small.之后,不再等待B
的A
将从此排序列表中删除以保持较小。
import numpy as np
import bisect
import time
N = 10
M = 3
#N=int(1e6)
#M=int(1e3)
thresh = 0.4
A = np.random.rand(N)
B = np.random.rand(N)
result = np.zeros(N)
l = []
t_start = time.time()
for i in range(N):
a = (A[i],i)
bisect.insort(l,a)
b = B[i]
firstLoseInd = bisect.bisect_left(l,(b+thresh,-1))
lastWinInd = bisect.bisect_right(l,(b-thresh,-1))
for j in range(lastWinInd):
curInd = l[j][1]
if curInd > i-M:
result[curInd] = 1
for j in range(firstLoseInd,len(l)):
curInd = l[j][1]
if curInd > i-M:
result[curInd] = -1
del l[firstLoseInd:]
del l[:lastWinInd]
t_done = time.time()
print(A)
print(B)
print(result)
print(t_done - t_start)
This is a sample output:这是一个示例 output:
[ 0.22643589 0.96092354 0.30098532 0.15569044 0.88474775 0.25458535
0.78248271 0.07530432 0.3460113 0.0785128 ]
[ 0.83610433 0.33384085 0.51055061 0.54209458 0.13556121 0.61257179
0.51273686 0.54850825 0.24302884 0.68037965]
[ 1. -1. 0. 1. -1. 0. -1. 1. 0. 1.]
For N = int(1e6)
and M = int(1e3)
it took about 3.4 seconds on my computer.对于N = int(1e6)
和M = int(1e3)
在我的计算机上花费了大约 3.4 秒。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.