I'm looking for an efficient way (without looping) to add a column to a dataframe, containing a sum over a column of that same dataframe, filtered by some values in the row. Example:
Dataframe:
ClientID Date Orders
123 2020-03-01 23
123 2020-03-05 10
123 2020-03-10 7
456 2020-02-22 3
456 2020-02-25 15
456 2020-02-28 5
...
I want to add a colum "orders_last_week" containing the total number of orders for that specific client in the 7 days before the given date. The Excel equivalent would be something like:
SUMIFS([orders],[ClientID],ClientID,[Date]>=Date-7,[Date]<Date)
So this would be the result:
ClientID Date Orders Orders_Last_Week
123 2020-03-01 23 0
123 2020-03-05 10 23
123 2020-03-10 7 10
456 2020-02-22 3 0
456 2020-02-25 15 3
456 2020-02-28 5 18
...
I can solve this with a loop, but since my dataframe contains >20M records, this is not a feasible solution. Can anyone please help me out? Much appreciated!
I'll assume your dataframe is named df
. I'll also assume that dates aren't repeated for a given ClientID
, and are in ascending order (If this isn't the case, do a groupby sum and sort the result so that it is).
The gist of my solution is, for a given ClientID and Date.
rolling
to check the next 7 rows for dates that are within the 1-week timespan.Actually, we use 8 rows, because, eg, SuMoTuWeThFrSaSu has 8 days.
What makes this hard is that rolling aggregates columns one at a time, and so doesn't obviously allow you to work with multiple columns when aggregating. If it did, you could make a filter using the date column, and use that to sum the orders.
There is a loophole, though: you can use multiple columns if you're happy to smuggle them in via the index!
I use some helper functions. Note a
is understood to be a pandas series with 8 rows and values "Orders", with "Date" in the index.
Curious to know what performance is like on your real data.
import pandas as pd
data = {
'ClientID': {0: 123, 1: 123, 2: 123, 3: 456, 4: 456, 5: 456},
'Date': {0: '2020-03-01', 1: '2020-03-05', 2: '2020-03-10',
3: '2020-02-22', 4: '2020-02-25', 5: '2020-02-28'},
'Orders': {0: 23, 1: 10, 2: 7, 3: 3, 4: 15, 5: 5}
}
df = pd.DataFrame(data)
# Make sure the dates are datetimes
df['Date'] = pd.to_datetime(df['Date'])
# Put into index so we can smuggle them through "rolling"
df = df.set_index(['ClientID', 'Date'])
def date(a):
# get the "Date" index-column from the dataframe
return a.index.get_level_values('Date')
def previous_week(a):
# get a column of 0s and 1s identifying the previous week,
# (compared to the date in the last row in a).
return (date(a) >= date(a)[-1] - pd.DateOffset(days=7)) * (date(a) < date(a)[-1])
def previous_week_order_total(a):
#compute the order total for the previous week
return sum(previous_week(a) * a)
def total_last_week(group):
# for a "ClientID" compute all the "previous week order totals"
return group.rolling(8, min_periods=1).apply(previous_week_order_total, raw=False)
# Ok, actually compute this
df['Orders_Last_Week'] = df.groupby(['ClientID']).transform(total_last_week)
# Reset the index back so you can have the ClientID and Date columns back
df = df.reset_index()
The above code relies upon the fact that the past week encompasses at most 7 rows of data ie, the 7 days in a week (although in your example, it is actually less than 7)
If your time window is something other than a week, you'll need to replace all the references to a the length of a week in terms of the finest division of your timestamps.
For example, if your date timestamps are spaced are no closer than 1 second, and you are interested in a time window of 1 minutes (eg, "Orders_last_minute"), replace pd.DateOffset(days=7)
with pd.DateOffset(seconds=60)
, and group.rolling(8,...
with group.rolling(61,....)
Obviously, this code is a bit pessimistic: for each row, it always looks at 61 rows, in this case. Unfortunately rolling
does not offer a suitable variable window size function. I suspect that in some cases a python loop that takes advantage of the fact that the dataframe is sorted by date might run faster than this partly-vectorized solution.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.