简体   繁体   中英

pandas dataframe calculate rolling mean using cutomized window size

I'm trying to calculate the rolling mean/std for a column in dataframe. The pandas or numpy_ext rolling methods seem to need a fixed window size. The dataframe has a column "dates", I want to decide the window size based on the "dates", for example, when calculating mean/std, for rows at day 10, including rows from day 2 to day 6, for rows at day 11, including rows from day 3 to day 7, for rows at day 12, including rows from day 4 to day 8, etc.

I want to know if there are methods to do it except the brute force coding. Sample data, "quantity" is the target field to calculate mean and std.

dates material location quantity
1 C A 870
2 D A 920
3 C A 120
4 D A 120
6 C A 120
8 D A 1200
8 c A 720
10 D A 480
11 D A 600
12 C A 720
13 D A 80
13 D A 600
14 D A 1200
18 E B 150

For example, for each row, I want to get the rolling mean for "quantity" of the previous 3-8 days (if any), the expected output will be:

| dates | material | location | quantity | Mean                    |
|-------|--------- |----------|----------|-------------------------|
| 1     | C        | A        | 870      | Nan                     |
| 2     | D        | A        | 920      | Nan                     |
| 3     | C        | A        | 120      | Nan                     |
| 4     | D        | A        | 120      | Nan                     |
| 6     | C        | A        | 120      |(870+920)/2 = 895        |
| 8     | D        | A        | 1200     |(870+920+120+120)/4=507.5|
| 8     | c        | A        | 720      |(870+920+120+120)/4=507.5|
| 10    | D        | A        | 480      |(920+120+120+120)/4=320  |
| 11    | D        | A        | 600      |(120+120+120)/3=120      |
| 12    | C        | A        | 720      |(120+120+1200+720)/4=540 |
| 13    | D        | A        | 80       |(120+1200+720)/3=680     |
| 13    | D        | A        | 600      |(120+1200+720)/3=680     |
| 14    | D        | A        | 1200     |(120+1200+720+480)/4=630 |
| 18    | E        | B       | 150|(480+600+720+80+600+1200)/6=613|

A follow-up question:

Is there a way to further filter the window by other columns? For example, when calculating the rolling mean for "quantity" of the previous 3-8 days, the rows in the rolling window must have the same "material" as the corresponding row. So the new expected output would be:

| dates | material | location | quantity | Mean                  |
|-------|--------- |----------|----------|-----------------------|
| 1     | C        | A        | 870      | Nan                   |
| 2     | D        | A        | 920      | Nan                   |
| 3     | C        | A        | 120      | Nan                   |
| 4     | D        | A        | 120      | Nan                   |
| 6     | C        | A        | 120      |(870)/1 = 870          |
| 8     | D        | A        | 1200     |(920+120)/2=520        |
| 8     | c        | A        | 720      |(870+120)/2=495        |
| 10    | D        | A        | 480      |(920+120)/2=520        |
| 11    | D        | A        | 600      |(120)/1 = 120          |
| 12    | C        | A        | 720      |(120+720)/2=420        |
| 13    | D        | A        | 80       |(1200)/1=1200          |
| 13    | D        | A        | 600      |(1200)/1=1200          |
| 14    | D        | A        | 1200     |(1200+480)/2=840       |
| 18    | E        | B        | 150      |Nan                    |

The DataFrame constructor for anyone else to try:

d = {'dates': [1, 2, 3, 4, 6, 8, 8, 10, 11, 12, 13, 13, 14, 18],
     'material': ['C','C','C','C','C','D','D','D','D','D','D','D','D','E'],
     'location':['A','A','A','A','A','A','A','A','A','A','A','A','A','B'],
     'quantity': [870, 920, 120, 120, 120, 1200, 720, 480, 600, 720, 80, 600, 1200, 150]}
df = pd.DataFrame(d)

df.rolling does accept "a time period of each window. Each window will be a variable sized based on the observations included in the time-period. This is only valid for datetimelike indexes."

So we would have to convert your days to datetimelike (eg, a pd.Timestamp , or a pd.Timedelta ), and set it as index.

But this method won't have the ability perform the shift that you want (eg, for day 14 you want not up to day 14 but up to day 10: 4 days before it).


So there is another option, which df.rolling also accepts:

Use a BaseIndexer subclass

There is very little documentation on it and I'm not an expert, but I was able to hack my solution into it. Surely, there must be a better (proper) way to use all its attributes correctly, and hopefully someone will show it in their answer here.

How I did it:

Inside our BaseIndexer subclass, we have to define the get_window_bounds method that returns a tuple of ndarrays: index positions of the starts of all windows, and those of the ends of all windows respectively (index positions like the ones that can be used in iloc - not with loc ).

To to find them, I used the most efficient method from this answer : np.searchsorted .

Your 'dates' must be sorted for this.

Any keyword arguments that we pass to the BaseIndexer subclass constructor will be set as its attributes. I will set day_from , day_to and days :

from pandas.api.indexers import BaseIndexer

class CustomWindow(BaseIndexer):    
    """
    Indexer that selects the dates.

    It uses the arguments:
    ----------------------
    day_from : int
    day_to : int
    days : np.ndarray
    """
    def get_window_bounds(self,
                          num_values: int, 
                          min_periods: int | None,
                          center: bool | None, 
                          closed: str | None) -> tuple[np.ndarray, np.ndarray]:
        """
        I'm not using these arguments, but they must be present (not sure why):
            `num_values` is the length of the df,
            `center`: False, `closed`: None.
        """
        days = self.days
        # With `side` I'm making both ends inclusive:
        window_starts = np.searchsorted(days, days + self.day_from, side='left')  
        window_ends =  np.searchsorted(days, days + self.day_to, side='right')
        return (window_starts, window_ends)
# In my implementation both ends are inclusive:
day_from = -8
day_to = -4
days = df['dates'].to_numpy()

my_indexer = CustomWindow(day_from=day_from, day_to=day_to, days=days)
df[['mean', 'std']] = (df['quantity']
                       .rolling(my_indexer, min_periods=0)
                       .agg(['mean', 'std']))

Result:

    dates material location  quantity        mean         std
0       1        C        A       870         NaN         NaN
1       2        C        A       920         NaN         NaN
2       3        C        A       120         NaN         NaN
3       4        C        A       120         NaN         NaN
4       6        C        A       120  895.000000   35.355339
5       8        D        A      1200  507.500000  447.911822
6       8        D        A       720  507.500000  447.911822
7      10        D        A       480  320.000000  400.000000
8      11        D        A       600  120.000000    0.000011
9      12        D        A       720  540.000000  523.067873
10     13        D        A        80  680.000000  541.109970
11     13        D        A       600  680.000000  541.109970
12     14        D        A      1200  630.000000  452.990066
13     18        E        B       150  613.333333  362.803896

You can perform your operation with a rolling , you however have to pre- and post-process the DataFrame a bit to generate the shift:

A = 3
B = 8
s = (df
  # de-duplicate by getting the sum/count per identical date
 .groupby('dates')['quantity']
 .agg(['sum', 'count'])
  # reindex to fill missing dates
 .reindex(range(df['dates'].min(),
                df['dates'].max()+1),
         fill_value=0)
  # compute classical rolling
 .rolling(B-A, min_periods=1).sum()
 # compute mean
 .assign(mean=lambda d: d['sum']/d['count'])
 ['mean'].shift(A+1)
 )

df['Mean'] = df['dates'].map(s)

output:

    dates material location  quantity        Mean
0       1        C        A       870         NaN
1       2        C        A       920         NaN
2       3        C        A       120         NaN
3       4        C        A       120         NaN
4       6        C        A       120  895.000000
5       8        D        A      1200  507.500000
6       8        D        A       720  507.500000
7      10        D        A       480  320.000000
8      11        D        A       600  120.000000
9      12        D        A       720  540.000000
10     13        D        A        80  680.000000
11     13        D        A       600  680.000000
12     14        D        A      1200  630.000000
13     18        E        B       150  613.333333

Another possible solution:

def f(x):
  return np.arange(np.amax([0, x-8]), np.amax([0, x-3]))

df['Mean'] = df.dates.map(lambda x:  df.quantity[df.dates.isin(f(x))].mean())

Output:

    dates material location  quantity        Mean
0       1        C        A       870         NaN
1       2        C        A       920         NaN
2       3        C        A       120         NaN
3       4        C        A       120         NaN
4       6        C        A       120  895.000000
5       8        D        A      1200  507.500000
6       8        D        A       720  507.500000
7      10        D        A       480  320.000000
8      11        D        A       600  120.000000
9      12        D        A       720  540.000000
10     13        D        A        80  680.000000
11     13        D        A       600  680.000000
12     14        D        A      1200  630.000000
13     18        E        B       150  613.333333
14     19        E        B      1416  640.000000
15     20        F        B      1164  650.000000
16     21        G        B     11520  626.666667

Inspired by @PaulS's answer, here is a simple way to select based on conditions from multiple columns:

def get_selection(row):
    dates_mask = (df['dates'] < row['dates'] - 3) & (df['dates'] >= row['dates'] - 8)
    material_mask = df['material'] == row['material']
    return df[dates_mask & material_mask]

df['Mean'] = df.apply(lambda row: get_selection(row)['quantity'].mean(), 
                      axis=1)
df['Std'] = df.apply(lambda row: get_selection(row)['quantity'].std(), 
                     axis=1)
    dates material location  quantity    Mean         Std
0       1        C        A       870     NaN         NaN
1       2        D        A       920     NaN         NaN
2       3        C        A       120     NaN         NaN
3       4        D        A       120     NaN         NaN
4       6        C        A       120   870.0         NaN
5       8        D        A      1200   520.0  565.685425
6       8        C        A       720   495.0  530.330086
7      10        D        A       480   520.0  565.685425
8      11        D        A       600   120.0         NaN
9      12        C        A       720   420.0  424.264069
10     13        D        A        80  1200.0         NaN
11     13        D        A       600  1200.0         NaN
12     14        D        A      1200   840.0  509.116882
13     18        E        B       150     NaN         NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM