简体   繁体   中英

pandas filter out entire days of dataset based on value

If I have some made up data:

import pandas as pd 
import numpy as np 
from numpy.random import randint


np.random.seed(10)  # added for reproductibility                                                                                                                                                                 

import numpy as np
import pandas as pd
np.random.seed(11)

rows,cols = 50000,2
data = np.random.rand(rows,cols) 
tidx = pd.date_range('2019-01-01', periods=rows, freq='T') 
df = pd.DataFrame(data, columns=['Temperature','Value'], index=tidx)

mediany= df.Value.median()

How could I filter the df based on filtering out entire days of data? For example for each day of data, if the average of Value is less than the entire average Value for entire dataset ( mediany ), discard this day..

I think I can filter out all data that is less than the average but this doesn't preserve complete days of data which I need.

df = df[(df[['Value']] >= mediany).all(axis=1)]

df

Any tips greatly appreciated hopefully this makes sense!

You can use groupby().transform :

s = (df['Value'].ge(mediany)            # compare to mediany
        .groupby(df.index.normalize())  # groupby day 
        .transform('any')               # any time with value larger than median
    )

df[s]

P/S: the entire average Value for entire dataset (mediany) , median is not average:-)

You can try the following code. I have added comments:-

import pandas as pd 
import numpy as np 
from numpy.random import randint


np.random.seed(10)  # added for reproductibility                                                                                                                                                                 

import numpy as np
import pandas as pd
np.random.seed(11)

rows,cols = 50000,2
data = np.random.rand(rows,cols) 
tidx = pd.date_range('2019-01-01', periods=rows, freq='T') 
df = pd.DataFrame(data, columns=['Temperature','Value'], index=tidx)
df["Date"] = df.index

#calculate day wise mean

def calculate_mean(x):
    return np.mean(x)

df_day_mean = df.groupby(df.index).agg({
        'Value': [('Value', calculate_mean)]
    })
df_day_mean.columns = df_day_mean.columns.droplevel()


#calculate whole mean

mean = df.Value.mean()

#get the days where average value is less than whole mean

df_to_discard = df_day_mean[df_day_mean["Value"]<mean]
index_to_discard = df_to_discard.index

# drop these indices from the original df

filtered_df = df.drop(index_to_discard)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM