简体   繁体   中英

How to sum over a Pandas dataframe conditionally

I'm looking for an efficient way (without looping) to add a column to a dataframe, containing a sum over a column of that same dataframe, filtered by some values in the row. Example:

Dataframe:

ClientID   Date           Orders
123        2020-03-01     23
123        2020-03-05     10
123        2020-03-10     7
456        2020-02-22     3
456        2020-02-25     15
456        2020-02-28     5
...

I want to add a colum "orders_last_week" containing the total number of orders for that specific client in the 7 days before the given date. The Excel equivalent would be something like:

SUMIFS([orders],[ClientID],ClientID,[Date]>=Date-7,[Date]<Date)

So this would be the result:

ClientID   Date           Orders  Orders_Last_Week
123        2020-03-01     23      0
123        2020-03-05     10      23
123        2020-03-10     7       10
456        2020-02-22     3       0
456        2020-02-25     15      3
456        2020-02-28     5       18
...

I can solve this with a loop, but since my dataframe contains >20M records, this is not a feasible solution. Can anyone please help me out? Much appreciated!

I'll assume your dataframe is named df . I'll also assume that dates aren't repeated for a given ClientID , and are in ascending order (If this isn't the case, do a groupby sum and sort the result so that it is).

The gist of my solution is, for a given ClientID and Date.

  1. Use groupby.transform to split this problem up by ClientID.
  2. Use rolling to check the next 7 rows for dates that are within the 1-week timespan.
  3. In those 7 rows, dates within the timespan are labelled True (=1). Dates that are not are labelled False (=0).
  4. In those 7 rows, multiply the Orders column by the True/False labelling of dates.
  5. Sum the result.

Actually, we use 8 rows, because, eg, SuMoTuWeThFrSaSu has 8 days.

What makes this hard is that rolling aggregates columns one at a time, and so doesn't obviously allow you to work with multiple columns when aggregating. If it did, you could make a filter using the date column, and use that to sum the orders.

There is a loophole, though: you can use multiple columns if you're happy to smuggle them in via the index!

I use some helper functions. Note a is understood to be a pandas series with 8 rows and values "Orders", with "Date" in the index.

Curious to know what performance is like on your real data.

import pandas as pd

data =  {
    'ClientID': {0: 123, 1: 123, 2: 123, 3: 456, 4: 456, 5: 456},
    'Date': {0: '2020-03-01', 1: '2020-03-05', 2: '2020-03-10',
             3: '2020-02-22', 4: '2020-02-25', 5: '2020-02-28'},
 'Orders': {0: 23, 1: 10, 2: 7, 3: 3, 4: 15, 5: 5}
}

df = pd.DataFrame(data)

# Make sure the dates are datetimes
df['Date'] = pd.to_datetime(df['Date'])

# Put into index so we can smuggle them through "rolling"
df = df.set_index(['ClientID', 'Date'])


def date(a):
    # get the "Date" index-column from the dataframe 
    return a.index.get_level_values('Date')

def previous_week(a):
    # get a column of 0s and 1s identifying the previous week, 
    # (compared to the date in the last row in a).
    return (date(a) >= date(a)[-1] - pd.DateOffset(days=7)) * (date(a) < date(a)[-1]) 

def previous_week_order_total(a):
    #compute the order total for the previous week
    return sum(previous_week(a) * a)

def total_last_week(group):
    # for a "ClientID" compute all the "previous week order totals"
    return group.rolling(8, min_periods=1).apply(previous_week_order_total, raw=False)

# Ok, actually compute this
df['Orders_Last_Week'] = df.groupby(['ClientID']).transform(total_last_week)

# Reset the index back so you can have the ClientID and Date columns back
df = df.reset_index()

The above code relies upon the fact that the past week encompasses at most 7 rows of data ie, the 7 days in a week (although in your example, it is actually less than 7)

If your time window is something other than a week, you'll need to replace all the references to a the length of a week in terms of the finest division of your timestamps.

For example, if your date timestamps are spaced are no closer than 1 second, and you are interested in a time window of 1 minutes (eg, "Orders_last_minute"), replace pd.DateOffset(days=7) with pd.DateOffset(seconds=60) , and group.rolling(8,... with group.rolling(61,....)

Obviously, this code is a bit pessimistic: for each row, it always looks at 61 rows, in this case. Unfortunately rolling does not offer a suitable variable window size function. I suspect that in some cases a python loop that takes advantage of the fact that the dataframe is sorted by date might run faster than this partly-vectorized solution.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM