Python 循環遍歷數據幀行，直到第一次滿足條件

Question

我有一個 Pandas 數據框，我想在其中循環遍歷其行並計算從第一行到第二行的度量，如果在那里找不到，請檢查從第一行到第三行、第四行等，並將該度量與另一個值進行比較。 我想獲得第一次滿足條件的行號。 舉一個具體的例子，對於長度為 30 的數據幀，它可能來自df.iloc[0:10] df.iloc[10:15]和df.iloc[15:27] , df.iloc[27:30] ，其中值 10、15、27 存儲在列表中。

一個示例數據框：

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0,100, size=(100, 1)), columns=list('A'))
df  
    A
0   5
1  11
2   8
3   1
4  16
5  24

some_value = 20 
mylist = []
for i in range(len(df)):
    for j in range(i+2, range(len(df)):
        # Metric calculated on the relevant rows
        metric = df.iloc[i:j]['A'].sum()
        if metric >= some_value:
           mylist.append(j)
           break

循環從df.iloc[0:2] ，計算 5+11，因為它不大於 some_value (20)，它傳遞給df.iloc[0:3] 。 這一次，由於 5+11+8 大於 some_value，我想保存這個數字 (2) 並且不檢查df.iloc[0:4] 。 然后循環應該再次從df.iloc[3:5]開始檢查（1+16），因為不滿足條件，繼續df.iloc[3:6] （1+16+24）和以此類推，並在滿足條件時保存積分。

這種情況下的示例輸出是一個包含值的列表： [2, 5]

我寫了上面的代碼，但不能完全實現我想要的。你能幫忙解決這個問題嗎？ 謝謝。

Answer 1

目前，您的循環是 O(n^2)。 但是一旦找到與 i 的起始值匹配的值，您的外循環必須從 i+1 重新開始，而您不想從那里開始。 你想從 j 開始。 這是對您的代碼的快速修復。

我目前沒有 numpy，所以我使用 python 列表作為數據。

data = [5, 11, 8, 1, 16, 24]
some_value = 20 
mylist = []
j = 0
for i in range(len(data)):
    # can't change iteration so just skip ahead with continue
    if i < j:
        continue
    # range expects second argument to be past the end
    # dunno if df is the same, but probably?
    for j in range(i+1, len(data)+1):
        metric = sum(data[i:j])
        if metric >= some_value:
            mylist.append(j-1)
            break
print(mylist)

[2, 5]

我建議在一個循環中執行此操作，並保持運行總數（累加器）。 在這里，我有點喜歡返回范圍，以防您想拼接 df：

data = [5, 11, 8, 1, 16, 24]
threshold = 20

def accumulate_to_threshold(data, threshold):
    start = 0
    total = 0
    for index, item in enumerate(data):
        total += item
        if total > threshold:
            yield (start, index+1)
            total = 0
            start = index+1
    # leftovers below threshold here

for start, end in accumulate_to_threshold(data, threshold):
    sublist = data[start:end]
    print (sublist, "totals to", sum(sublist))

[5, 11, 8] 總數為 24
[1, 16, 24] 總計 41

當然，您可以生成索引並從上面獲取 [2, 5]，而不是生成一個范圍。

Answer 2

我的方法是：

numpy.reshape(values, newshape, ...)
.sum(axis=1)
布爾掩碼

我不知道這是否會以您想要的方式回答您的問題，但我將展示我的大腦如何使用 pandas/numpy 的內置矢量化來處理它，簡而言之，循環很麻煩（慢），如果可能：

import pandas as pd
import numpy as np

# made it smaller
df = pd.DataFrame(np.random.randint(0,25, size=(20, 1)), columns=list('A'))

numpy.reshape()和sum()

我們將重塑 col A ，它將值並排移動，然后求和穿過axis=1 ：

將df與下面的re_shaped進行比較。 注意這些值是如何重新排列的


re_shaped = np.reshape(df.A.values, (10, 2))
print(df)

     A
0    5
1   11
2    8
3   23
...
16   6
17  14
18   3
19   0

print(re_shaped)

array([[ 5, 11],
       [ 8, 23],
       ...
       [ 6, 14],
       [ 3,  0]])

summed = re_shaped.sum(axis=1)
print(summed)

array([16, 31, 15, 19, 13, 21, 28, 30, 20,  3])

布爾掩碼

some_value = 20
greater_than_some_value = summed[summed >= some_value]
print(greater_than_some_value)

array([31, 21, 28, 30, 20])

你有它。 希望它有所幫助。

Answer 3

您是否考慮過僅使用一個循環：

import pandas as pd
import numpy as np

n = int(1e6)
df = pd.DataFrame({"A": np.random.randint(100, size=n)})

threshold = 20
my_list = []
s = 0
for i, k in enumerate(df["A"].values):
    if s + k > threshold:
        my_list.append(i)
        s = 0
    else:
        s += k

您最終可以使用numba但我認為最好的想法是在您的df使用 reset 計算 cumsum 。

努巴

前一個可以寫成一個函數

def fun(vec, threshold=20):
    my_list = []
    s = 0
    for i, k in enumerate(vec):
        if s + k > threshold:
            my_list.append(i)
            s = 0
        else:
            s += k
    return my_list

我們可以使用 numba

from numba import jit

@jit(nopython=True, cache=True, nogil=True)
def fun_numba(vec, threshold=20):
    my_list = []
    s = 0
    for i, k in enumerate(vec):
        if s + k > threshold:
            my_list.append(i)
            s = 0
        else:
            s += k
    return my_list

%%timeit -n 5 -r 5
my_list = fun(df["A"].values)

606 ms ± 28 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

%%timeit -n 5 -r 5
my_list = fun_numba(df["A"].values)

59.6 ms ± 20.4 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

這是大約 10 倍的加速。

Python 循環遍歷數據幀行，直到第一次滿足條件

問題描述

3 個解決方案

解決方案1
0 2020-09-23 18:25:37

解決方案2
0 2020-09-23 19:00:04

解決方案3
0 2020-09-23 19:49:04

努巴

Python 循環遍歷數據幀行，直到第一次滿足條件

問題描述

3 個解決方案

解決方案1 0 2020-09-23 18:25:37

解決方案2 0 2020-09-23 19:00:04

解決方案3 0 2020-09-23 19:49:04

努巴

解決方案1
0 2020-09-23 18:25:37

解決方案2
0 2020-09-23 19:00:04

解決方案3
0 2020-09-23 19:49:04