简体   繁体   中英

Panda Dataframe Resampling based on column criteria

I want to resample a dataframe if cell in another column matches my criteria

df = pd.DataFrame({
        'timestamp': [
            '2013-03-01 08:01:00', '2013-03-01 08:02:00',
            '2013-03-01 08:03:00', '2013-03-01 08:04:00',
            '2013-03-01 08:05:00', '2013-03-01 08:06:00'
        ],
        'Kind': [
            'A', 'B', 'A', 'B', 'A', 'B'
        ],
        'Values': [1, 1.5, 2, 3, 5, 3]
    })

For every timestamp, I may have 2-10 kinds, and I want to resample correctly without producing NaN . Currently I resample on the entire dataframe using below code and get NaNs . I think it's due to I have multiple entries for certain timestamps.

df.set_index('timestamp').resample('5Min').mean()

One method is to create different dataframes for every kind, resample every dataframe, and join the resulting dataframes. I'd like to find out if there's any simple way of doing it.

After defining your dataframe as you stated, you should transform timestamp column to datetime first. Then set it as the index and finally resampling and finding the mean as follows:

import pandas as pd
df = pd.DataFrame({
        'timestamp': [
            '2013-03-01 08:01:00', '2013-03-01 08:02:00',
            '2013-03-01 08:03:00', '2013-03-01 08:04:00',
            '2013-03-01 08:05:00', '2013-03-01 08:06:00'
        ],
        'Kind': [
            'A', 'B', 'A', 'B', 'A', 'B'
        ],
        'Values': [1, 1.5, 2, 3, 5, 3]
    })

df.timestamp = pd.to_datetime(df.timestamp)
df = df.set_index(["timestamp"])
df = df.resample("5Min")    
print df.mean()

This would print the mean you expect:

>>> 
Values    2.75

And your dataframe would result in:

>>> df
                     Values
timestamp                  
2013-03-01 08:05:00     2.5
2013-03-01 08:10:00     3.0

Grouping by kind

If you want to group by kind and get the mean of each Kind (means A and B) you can do as follows:

df.timestamp = pd.to_datetime(df.timestamp)
df = df.set_index(["timestamp"])
gb = df.groupby(["Kind"])
df = gb.resample("5Min")
print df.xs("A", level = "Kind").mean()
print df.xs("B", level = "Kind").mean()

As result you would get:

>>> 
Values    2.666667
Values    2.625

And your dataframe would finally look as:

>>> df
                            Values
Kind timestamp                    
A    2013-03-01 08:05:00  2.666667
B    2013-03-01 08:05:00  2.250000
     2013-03-01 08:10:00  3.000000

First, it is better practice to explicitly convert 'timestamp' column to DatetimeIndex type:

df = pd.DataFrame({
    'timestamp': pd.to_datetime([
        '2013-03-01 08:01:00', '2013-03-01 08:02:00',
        '2013-03-01 08:03:00', '2013-03-01 08:04:00',
        '2013-03-01 08:05:00', '2013-03-01 08:06:00']),
    'Kind':   ['A', 'B', 'A', 'B', 'A', 'B'],
    'Values': [ 1,  4.5,  2,   7,   5,   9] })

Please put attention to the changed values of B kind. Now, when you resample mean() estimates the new value as average of two existing ones. It might happen that more than one new data points fall between existing ones, and pandas fills their values with NaNs . You can use ffill() or bfill() , depending on whether side of the time interval you wish to be closed. By default it is left, so bfill() is the choice.

 df.set_index('timestamp').groupby('Kind').resample('1.5Min')['Values'].bfill().reset_index()

Out[1]:

    Kind    timestamp       Values
0   A   2013-03-01 08:00:00 1.0
1   A   2013-03-01 08:01:30 2.0
2   A   2013-03-01 08:03:00 2.0
3   A   2013-03-01 08:04:30 5.0
4   B   2013-03-01 08:01:30 4.5
5   B   2013-03-01 08:03:00 7.0
6   B   2013-03-01 08:04:30 9.0
7   B   2013-03-01 08:06:00 9.0

It will use last observed value to fill the NaNs .

If you wish to interpolate the values, and not just to fill the gaps, use transform(pd.Series.interpolate) combo. The transform will apply interpolate() function on each group. Try resampling with higher frequency (say 10 seconds), you will see the big difference between two approaches.

df = df.set_index('timestamp').groupby('Kind').resample('1.5Min').mean().transform(pd.Series.interpolate).reset_index()

Out[2]:

    Kind    timestamp       Values
0   A   2013-03-01 08:00:00 1.0
1   A   2013-03-01 08:01:30 1.5
2   A   2013-03-01 08:03:00 2.0
3   A   2013-03-01 08:04:30 5.0
4   B   2013-03-01 08:01:30 4.5
5   B   2013-03-01 08:03:00 7.0
6   B   2013-03-01 08:04:30 8.0
7   B   2013-03-01 08:06:00 9.0
df = df.set_index('timestamp') # Set your index.
df.index = df.index.astype('datetime64') # Set to DatetimeIndex (Index doesn't work with resample)
df.resample('5Min').mean() # Do the actual resampling.

This returns a dataframe with 2 rows as you would expect:

                    Values
timestamp                  
2013-03-01 08:00:00   1.875
2013-03-01 08:05:00   4.000

Your "Kind" column is dropped because it doesn't make sense to take mean of characters. If you wanted to keep it, you would have to introduce a new rule (for example assign the most frequent character for the given period).

set timestamp to type datetime and then use as the index.

df.timestamp = pd.to_datetime(df.timestamp)
df = df.set_index(["timestamp"])

sample from columns of your own choice, eg: sample from kind A:

df[df.Kind=='A'].sample(1)

 Kind  Values
timestamp                       
2013-03-01 08:03:00    A     2.0

sample then do calculation:

df[df.Kind=='A'].sample(2).mean()
Values    1.5
dtype: float64

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM