简体   繁体   中英

pandas - Resample a dataframe using a specified start date, end date and granularity

I want to resample a datetime indexed dataframe using a start date, an end date and a 'granularity'

Say I have this dataframe:

                   value
00:00, 01/05/2017    2
12:00, 01/05/2017    4
00:00, 02/05/2017    6
12:00, 02/05/2017    8
00:00, 03/05/2017   10
12:00, 03/05/2017   12

And I want to resample it to go from 06:00, 01/05/2017 to
18:00 02/05/2017 with a 'granularity' of 12 hours (this is the same as the original here for simplicity but doesn't have to be). The result I want is:

                   value
06:00, 01/05/2017    3
18:00, 01/05/2017    5
06:00, 02/05/2017    7
18:00, 02/05/2017    9

Note that the values are the mean of the values they overlap (eg 3 = mean(2,4))

I'm unsure how to do this.

My first attempt was:

def resample(df: DataFrame, start: datetime, end: datetime, granularity: timedelta) -> DataFrame:
    result = df.resample(granularity).mean()
    result = result[result.index <= end]
    result = result[result.index >= start]
    return result

This trims the data frame appropriately and ensures the correct granularity but doesn't align the results with the start date so the result is:

                   value
12:00, 01/05/2017    4
00:00, 02/05/2017    6
12:00, 02/05/2017    8

My second attempt used the base parameter to shift the data:

def resample(df: DataFrame, start: datetime, end: datetime, desired_granularity: timedelta) -> DataFrame:
    data_before_start = df[df.index <= start]
    # Get the last index value before our start date
    last_date_before_start = data_before_start.last_valid_index()
    current_granularity_secs = seconds_between_measurements(df)
    rule = str(int(desired_granularity.total_seconds())) + 'S'
    base = current_granularity_secs - (start - last_date_before_start).total_seconds()
    result = df.resample(rule, base=base).mean()
    result = result[result.index < end]
    result = result[result.index >= start]
    return result

This gives me:

                   value
06:00, 01/05/2017    4
18:00, 01/05/2017    6
06:00, 02/05/2017    8
18:00, 02/05/2017    10

This has the right indices but the values are backfilled from the next measurement rather than averaged from the measurements before and after.

Does anyone have any ideas on how I can achieve what I want?

Thanks in advance for your help and just let me know if I've left out any crucial details :)

EDIT: If getting the mean is the bit that makes this very tricky, I could settle for using the value before the given time, similar to pad(). My current 'best' solution gives me the value after, like backfill()

First define your end_start and end_date columns as datetime. Then, you can use .resample two times:

  • On df.start_date with a forward filling
  • On df.end_date with a backward filling

Then:

  • Keep row where start_date < end_date
  • Concatenate
  • Apply on each row a function to update start_date and end_date:

Here the code:

df[["start_date","end_date"]] = df[["start_date","end_date"]].astype(np.datetime64)
df1 = df.set_index("start_date").resample(freq).pad().reset_index()
df2 = df.set_index("end_date").resample(freq).bfill().reset_index()
df3 = pd.concat([df1, df2], ignore_index=True)

def function(x, df1):
    if x.name < df1.shape[0]:
        x.end_date = x.start_date + pd.Timedelta(freq)
    else:
        x.start_date = x.end_date - pd.Timedelta(freq)
    return x

df3[ df3.start_date < df3.end_date ].apply(lambda x: function(x, df1), axis=1)

Pandas documentation say that it should be possible directly to resample

df.resample(freq, on='start_date')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM