简体   繁体   中英

How to find the first occurrence of a boolean value for a given day using pandas?

(Not duplicate / I did my research)

My minute-based dataframe looks like this:

time,                  price_bool,    price_date
2017-01-01 00:00:00,   False, 
2017-01-01 00:01:00,   False, 
2017-01-01 00:02:00,   True,          2017-01-01 00:02:00
2017-01-01 00:03:00,   False, 
2017-01-01 00:04:00,   False, 
2017-01-01 00:05:00,   True,          2017-01-01 00:05:00
....

Right now it is a minute-based dataset. I want to group by day by the first occurrence of True and skip to another day once the first True is found. If there are no True in a given minute-based dataset, then that day will have 0 on the price_date .

My new dataframe should look like this:

time,                  price_bool,    price_date
2017-01-01 00:00:00,   True,          2017-01-01 00:02:00
2017-01-02 00:00:00,   True,          2017-01-02 00:07:00
2017-01-03 00:00:00,   True,          2017-01-03 02:21:00
2017-01-04 00:00:00,   True,          2017-01-04 01:17:00
....

This is the day based dataset where price_bool is True and corrsponding price_date when it was first True for a given day

What did I do?

  • First I tried to remove the empty field
  • After that, I tried to groupby('time')

However, it does not work.

Simpler starting data:

df = pd.DataFrame([
    ["2017-01-01 00:00:00",False,pd.np.nan], 
    ["2017-01-01 00:00:01",True,"2017-01-01 00:00:01"], 
    ["2017-01-01 00:00:02",True,"2017-01-01 00:00:01"],
    ["2017-01-02 00:00:00",False,pd.np.nan], 
], columns=['time','price_bool','price_date'])
df['time'] = df['time'].apply(pd.to_datetime)

This should get you the data you show in your result (note this assumes you're already sorted in chronological order):

res = df[df['price_bool'] == True].groupby(df['time'].dt.date)[['price_bool','price_date']].first().reset_index()

However, I think you're saying that you want to keep dates with price_bool false and have the price_date be 0 in that case. So you would need to add back the dates that are missing in res above. Here's one option:

# Get the True data set right.
res = df[df['price_bool'] == True].groupby(df['time'].dt.date)[['price_bool','price_date']].first()
# Add back the missing dates with only False values
res = res.reindex(df['time'].dt.date.unique()).reset_index()
# Fill in the null values.
res = res.fillna({'price_bool':False, 'price_date':0})

Out (note I created a simpler starting data set):

        time    price_bool  price_date
0   2017-01-01  True    2017-01-01 00:00:01
1   2017-01-02  False   0
df.sort_values('time').sort_values('price_bool', ascending = False).groupby(df['time'].dt.date).first()

Output with your provided df:

>>> df
time        price_bool
2017-01-01  True

Explanation : You want to sort by two columns: time and price_bool . The latter needs to be sorted in reverse as you want True to appear before False . Then, since groupby preserves sorting, you can simply select the first element from each group after grouping by date.

IIUC:

first_true_daily = df.groupby(pd.Grouper(key='time', freq='D'))['price_bool'].idxmax()

df.loc[first_true_daily]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM