简体   繁体   中英

Aggregate hourly data and count duplicates in pandas python

I have a dataframe df that looks like this:

time Hit/Miss
2016-09-29 08:00:00 FN
2016-09-29 08:30:00 FN
2016-09-29 09:45:00 TP
2016-10-05 14:00:00 FP

time is imported straight from a file without being set as an index, and Hit/Miss contains three categorical values FN , TP , and FP .

Now, I'd like to group the data by hour, day, and month and count how many FN , TP , FP 's occurred during the hour. As you can see from the brief example dataset. Now, every hour has Hit/Miss values.

This is what I have so far after grouping by day, which does give me the desired output:

df['time'] = pd.to_datetime(df['time'])
df['year'] = df['time'].dt.year
df['month'] = df['time'].dt.month
df['day'] = df['time'].dt.day

HM_DD = pd.crosstab([df['year'],df['month'],df['day']], df['Hit/Miss'], dropna=True)
HM_MM = pd.crosstab([df['year'],df['month']], df['Hit/Miss'], dropna=True)
HM_YYMM = pd.crosstab(df['year'], df['Hit/Miss'], dropna=True)

HM_DD = HM_DD.reset_index()

And the output looks like the snip shot desired output here: 图片

It does show what I need to see; however, I feel like there's a better way to do it. After all, I need to visualize the data by day and year (extra qn if you have time: What would be the best way/chart to visualize it?), so I'd like to keep the time intact for better maneuver later on, not like how I separated them into the year, month, day. I have tried setting time as an index and resample() , but that didn't work for me.

I would appreciate the help!

You can use offsets to aggregate your data:

HM_DD = pd.crosstab(df['time'].dt.normalize(), df['Hit/Miss'])
HM_MM = pd.crosstab(df['time'].dt.date+pd.offsets.MonthBegin(-1), df['Hit/Miss'])
HM_YYMM = pd.crosstab(df['time'].dt.date+pd.offsets.YearBegin(-1), df['Hit/Miss'])

Output:

>>> HM_DD.reset_index().rename_axis(columns=None)
        time  FN  FP  TP
0 2016-09-29   2   0   1
1 2016-10-05   0   1   0

>>> HM_MM.reset_index().rename_axis(columns=None)
        time  FN  FP  TP
0 2016-09-01   2   0   1
1 2016-10-01   0   1   0

>>> HM_YYMM.reset_index().rename_axis(columns=None)
        time  FN  FP  TP
0 2016-01-01   2   1   1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM