使用Pandas數據框提高Python循環性能

Question

請考慮以下DataFrame df：

timestamp    id        condition
             1234      A    
             2323      B
             3843      B
             1234      C
             8574      A
             9483      A

根據列條件中包含的條件，我必須在此數據框中定義一個新列，該列計算該條件中有多少個ID。 但是，請注意，由於DataFrame是由timestamp列排序的，因此可能會有多個具有相同id的條目，然后簡單的.cumsum（）並不是可行的選擇。

我已經給出了以下代碼，該代碼可以正常運行，但是速度非常慢：

#I start defining empty arrays
ids_with_condition_a = np.empty(0)
ids_with_condition_b = np.empty(0)
ids_with_condition_c = np.empty(0)

#Initializing new column
df['count'] = 0

#Using a for loop to do the task, but this is sooo slow!
for r in range(0, df.shape[0]):
    if df.condition[r] == 'A':
        ids_with_condition_a = np.append(ids_with_condition_a, df.id[r])
    elif df.condition[r] == 'B':
        ids_with_condition_b = np.append(ids_with_condition_b, df.id[r])
        ids_with_condition_a = np.setdiff1d(ids_with_condition_a, ids_with_condition_b)
    elifif df.condition[r] == 'C':
        ids_with_condition_c = np.append(ids_with_condition_c, df.id[r])

df.count[r] = ids_with_condition_a.size

保留這些Numpy數組對我來說非常有用，因為它會給出特定條件下的ID列表。 我也可以將這些數組動態地放入df DataFrame中的相應單元格中。

就性能而言，您能夠提出更好的解決方案嗎？

Answer 1

您需要在'condition'列和cumcount上使用groupby來計算每個條件中直到當前行的ID數量（這似乎是您的代碼所做的事情）：

df['count'] = df.groupby('condition').cumcount()+1 # +1 is to start at 1 not 0

使用輸入樣本，您將獲得：

     id condition  count
0  1234         A      1
1  2323         B      1
2  3843         B      2
3  1234         C      1
4  8574         A      2
5  9483         A      3

這比使用循環更快for

例如，如果您只想讓行帶有條件A，則可以使用一個掩碼，例如，如果進行print (df[df['condition'] == 'A']) ，則看到的行僅帶有條件egal到A。所以要得到一個數組，

arr_A = df.loc[df['condition'] == 'A','id'].values
print (arr_A)
array([1234, 8574, 9483])

編輯：為每個條件創建兩列，您可以為條件A做例如：

# put 1 in a column where the condition is met
df['nb_cond_A'] = pd.np.where(df['condition'] == 'A',1,None)
# then use cumsum for increment number, ffill to fill the same number down
# where the condition is not meet, fillna(0) for filling other missing values
df['nb_cond_A'] = df['nb_cond_A'].cumsum().ffill().fillna(0).astype(int)
# for the partial list, first create the full array
arr_A = df.loc[df['condition'] == 'A','id'].values
# create the column with apply (here another might exist, but it's one way)
df['partial_arr_A'] = df['nb_cond_A'].apply(lambda x: arr_A[:x])

輸出看起來像這樣：

     id condition  nb_condition_A       partial_arr_A  nb_cond_A
0  1234         A               1              [1234]          1
1  2323         B               1              [1234]          1
2  3843         B               1              [1234]          1
3  1234         C               1              [1234]          1
4  8574         A               2        [1234, 8574]          2
5  9483         A               3  [1234, 8574, 9483]          3

那么對於B，C來說也是一樣。也許for cond in set(df['condition']) cond循環是可行的。

編輯2：一種想法來做您在注釋中說明的內容，但不確定會提高性能：

# array of unique condition
arr_cond = df.condition.unique()
#use apply to create row-wise the list of ids for each condition
df[arr_cond] = (df.apply(lambda row: (df.loc[:row.name].drop_duplicates('id','last')
                                          .groupby('condition').id.apply(list)) ,axis=1)
                  .applymap(lambda x: [] if not isinstance(x,list) else x))

一些解釋：對於每一行，選擇直到此行loc[:row.name]的數據loc[:row.name] ，刪除重復的'id'，並保留最后一個drop_duplicates('id','last') （在您的示例中，這意味着一旦我們到達第3行，就刪除了第0行，因為id 1234是兩次），然后根據條件groupby('condition')對數據進行分組，並將每個條件的id放在同一列表中id.apply(list) 。 該部分以帶有空列表的applymap開頭（您不能使用fillna（[]），這是不可能的）。

對於每種條件的長度，您可以執行以下操作：

for cond in arr_cond:
    df['len_{}'.format(cond)] = df[cond].str.len().fillna(0).astype(int)

結果是這樣的：

     id condition             A             B       C  len_A  len_B  len_C
0  1234         A        [1234]            []      []      1      0      0
1  2323         B        [1234]        [2323]      []      1      1      0
2  3843         B        [1234]  [2323, 3843]      []      1      2      0
3  1234         C            []  [2323, 3843]  [1234]      0      2      1
4  8574         A        [8574]  [2323, 3843]  [1234]      1      2      1
5  9483         A  [8574, 9483]  [2323, 3843]  [1234]      2      2      1

使用Pandas數據框提高Python循環性能

問題描述

1 個解決方案

解決方案1
1 已采納 2018-07-01 11:05:29

使用Pandas數據框提高Python循環性能

問題描述

1 個解決方案

解決方案1 1 已采納 2018-07-01 11:05:29

解決方案1
1 已采納 2018-07-01 11:05:29