我該如何優化這個 python 循環？

Question

我在一個大型 csv 文件（150 萬行）上運行此代碼。 有沒有辦法優化？

df 是 pandas dataframe。 我排了一行，想知道在接下來的 1000 行中發生了什么：

我找到我的價值 + 0.0004 或者我找到我的價值 - 0.0004

result = []
for row in range(len(df)-1000):
    start = df.get_value(row,'A')
    win = start + 0.0004
    lose = start - 0.0004
    for n in range(1000):
        ref = df.get_value(row + n,'B')
        if ref > win:
            result.append(1)
            break
        elif ref <= lose:
            result.append(-1)
            break
        elif n==999 :
            result.append(0)

dataframe 就像：

         timestamp           A         B
0   20190401 00:00:00.127  1.12230  1.12236
1   20190401 00:00:00.395  1.12230  1.12237
2   20190401 00:00:00.533  1.12229  1.12234
3   20190401 00:00:00.631  1.12228  1.12233
4   20190401 00:00:01.019  1.12230  1.12234
5   20190401 00:00:01.169  1.12231  1.12236

結果是：結果[0,0,1,0,0,1,-1,1,...]

這是有效的，但需要很長時間來處理如此大的文件。

Answer 1

要為“第一個異常值”生成值，請定義以下 function：

def firstOutlier(row, dltRow = 4, dltVal = 0.1):
    ''' Find the value for the first "outlier". Parameters:
    row    - the current row
    dltRow - number of rows to check, starting from the current
    dltVal - delta in value of "B", compared to "A" in the current row
    '''
    rowInd = row.name                        # Index of the current row
    df2 = df.iloc[rowInd : rowInd + dltRow]  # "dltRow" rows from the current
    outliers = df2[abs(df2.B - row.A) >= dlt]
    if outliers.index.size == 0:  # No outliers within the range of rows
        return 0
    return int(np.sign(outliers.iloc[0].B - row.A))

然后將其應用於每一行：

df.apply(firstOutlier, axis=1)

這個 function 依賴於這樣一個事實，即 DataFrame 的索引由從 0 開始的連續數字組成，因此具有ind - 我們可以訪問它的任何行的索引，調用df.iloc[ind]和n行的切片，開始從這一行開始，調用df.iloc[ind: ind + n] 。

對於我的測試，我將參數的默認值設置為：

dltRow = 4 - 查看4行，從當前行開始，
dltVal = 0.1 - 查找當前行中B列“距離” 0.1或更多的行。

我的測試 DataFrame 是：

      A     B
0  1.00  1.00
1  0.99  1.00
2  1.00  0.80
3  1.00  1.05
4  1.00  1.20
5  1.00  1.00
6  1.00  0.80
7  1.00  1.00
8  1.00  1.00

結果（對於我的數據和參數的默認值）是：

0   -1
1   -1
2   -1
3    1
4    1
5   -1
6   -1
7    0
8    0
dtype: int64

根據您的需要，將 params 的默認值分別更改為1000和0.0004 。

Answer 2

這個想法是遍歷A和B ，同時保持A值的排序列表。 然后，對於每個B ，找到輸的最高A和贏的最低A 由於它是一個排序列表，因此要搜索O(log(n)) 。 只有那些在最后 1000 中具有索引的A用於設置結果向量。 之后，不再等待B的A將從此排序列表中刪除以保持較小。

import numpy as np
import bisect
import time

N = 10
M = 3
#N=int(1e6)
#M=int(1e3)
thresh = 0.4

A = np.random.rand(N)
B = np.random.rand(N)
result = np.zeros(N)

l = []

t_start = time.time()

for i in range(N):
    a = (A[i],i)
    bisect.insort(l,a)
    b = B[i]
    firstLoseInd = bisect.bisect_left(l,(b+thresh,-1))
    lastWinInd = bisect.bisect_right(l,(b-thresh,-1))
    for j in range(lastWinInd):
        curInd = l[j][1]
        if curInd > i-M:
            result[curInd] = 1
    for j in range(firstLoseInd,len(l)):
        curInd = l[j][1]
        if curInd > i-M:
            result[curInd] = -1
    del l[firstLoseInd:]
    del l[:lastWinInd]

t_done = time.time()

print(A)
print(B)
print(result)
print(t_done - t_start)

這是一個示例 output：

[ 0.22643589  0.96092354  0.30098532  0.15569044  0.88474775  0.25458535
  0.78248271  0.07530432  0.3460113   0.0785128 ]
[ 0.83610433  0.33384085  0.51055061  0.54209458  0.13556121  0.61257179
  0.51273686  0.54850825  0.24302884  0.68037965]
[ 1. -1.  0.  1. -1.  0. -1.  1.  0.  1.]

對於N = int(1e6)和M = int(1e3)在我的計算機上花費了大約 3.4 秒。

我該如何優化這個 python 循環？

問題描述

2 個解決方案

解決方案1
0 已采納 2019-11-16 21:17:30

解決方案2
-1 2019-11-16 20:50:02

我該如何優化這個 python 循環？

問題描述

2 個解決方案

解決方案1 0 已采納 2019-11-16 21:17:30

解決方案2 -1 2019-11-16 20:50:02

解決方案1
0 已采納 2019-11-16 21:17:30

解決方案2
-1 2019-11-16 20:50:02