如何有條件地總結 Pandas dataframe

Question

我正在尋找一種有效的方法（不循環）將一列添加到 dataframe 中，其中包含同一 dataframe 的列的總和，並由行中的一些值過濾。 例子：

Dataframe：

ClientID   Date           Orders
123        2020-03-01     23
123        2020-03-05     10
123        2020-03-10     7
456        2020-02-22     3
456        2020-02-25     15
456        2020-02-28     5
...

我想添加一個列“orders_last_week”，其中包含給定日期前 7 天內該特定客戶的訂單總數。 Excel 等效項類似於：

SUMIFS([orders],[ClientID],ClientID,[Date]>=Date-7,[Date]<Date)

所以這將是結果：

ClientID   Date           Orders  Orders_Last_Week
123        2020-03-01     23      0
123        2020-03-05     10      23
123        2020-03-10     7       10
456        2020-02-22     3       0
456        2020-02-25     15      3
456        2020-02-28     5       18
...

我可以用循環解決這個問題，但由於我的 dataframe 包含 >20M 記錄，這不是一個可行的解決方案。 誰能幫幫我？ 非常感激！

Answer 1

我假設您的 dataframe 被命名為df 。 我還將假設給定ClientID的日期不會重復，並且按升序排列（如果不是這種情況，請執行 groupby sum 並對結果進行排序，以便它是）。

我的解決方案的要點是，對於給定的 ClientID 和 Date。

使用 groupby.transform 按 ClientID 拆分此問題。
使用rolling檢查接下來的 7 行是否有 1 周時間跨度內的日期。
在這 7 行中，時間跨度內的日期標記為 True (=1)。 未標記的日期為 False (=0)。
在這 7 行中，將 Orders 列乘以日期的 True/False 標記。
對結果求和。

實際上，我們使用 8 行，因為例如 SuMoTuWeThFrSaSu 有 8 天。

使這變得困難的是滾動聚合一次一個列，因此顯然不允許您在聚合時使用多個列。 如果是這樣，您可以使用日期列進行過濾，並使用它來匯總訂單。

但是有一個漏洞：如果您願意通過索引將它們偷運進來，您可以使用多個列！

我使用了一些輔助函數。 注意a被理解為 pandas 系列，具有 8 行和值“Orders”，索引中帶有“Date”。

很想知道真實數據的性能如何。

import pandas as pd

data =  {
    'ClientID': {0: 123, 1: 123, 2: 123, 3: 456, 4: 456, 5: 456},
    'Date': {0: '2020-03-01', 1: '2020-03-05', 2: '2020-03-10',
             3: '2020-02-22', 4: '2020-02-25', 5: '2020-02-28'},
 'Orders': {0: 23, 1: 10, 2: 7, 3: 3, 4: 15, 5: 5}
}

df = pd.DataFrame(data)

# Make sure the dates are datetimes
df['Date'] = pd.to_datetime(df['Date'])

# Put into index so we can smuggle them through "rolling"
df = df.set_index(['ClientID', 'Date'])


def date(a):
    # get the "Date" index-column from the dataframe 
    return a.index.get_level_values('Date')

def previous_week(a):
    # get a column of 0s and 1s identifying the previous week, 
    # (compared to the date in the last row in a).
    return (date(a) >= date(a)[-1] - pd.DateOffset(days=7)) * (date(a) < date(a)[-1]) 

def previous_week_order_total(a):
    #compute the order total for the previous week
    return sum(previous_week(a) * a)

def total_last_week(group):
    # for a "ClientID" compute all the "previous week order totals"
    return group.rolling(8, min_periods=1).apply(previous_week_order_total, raw=False)

# Ok, actually compute this
df['Orders_Last_Week'] = df.groupby(['ClientID']).transform(total_last_week)

# Reset the index back so you can have the ClientID and Date columns back
df = df.reset_index()

上面的代碼依賴於過去一周最多包含 7 行數據的事實，即一周中的 7 天（盡管在您的示例中，它實際上小於 7）

如果您的時間 window 不是一周，您需要根據時間戳的最佳划分替換所有對一周長度的引用。

例如，如果您的日期時間戳間隔不小於 1 秒，並且您對 1 分鍾的時間 window 感興趣（例如，“Orders_last_minute”），請將pd.DateOffset(days=7)替換為pd.DateOffset(seconds=60)和group.rolling(8,...和group.rolling(61,....)

顯然，這段代碼有點悲觀：對於每一行，它總是查看 61 行，在這種情況下。 不幸的是， rolling不提供合適的變量 window 大小 function。 我懷疑在某些情況下，利用 dataframe 按日期排序這一事實的 python 循環可能比這個部分矢量化的解決方案運行得更快。

如何有條件地總結 Pandas dataframe

問題描述

1 個解決方案

解決方案1
1 已采納 2020-04-23 16:28:29

如何有條件地總結 Pandas dataframe

問題描述

1 個解決方案

解決方案1 1 已采納 2020-04-23 16:28:29

解決方案1
1 已采納 2020-04-23 16:28:29