简体   繁体   中英

Why are there different results for pandas groupby+resample on an appended dataframe

I want to groupby and resample a dataframe i have. I group by int_var and bool_var , and then I resample per 1Min to fill in any missing minutes in the dataset. This works perfectly fine for the base dataframe A:

date                  bool_var    int_var   
2021-01-01 00:03:00   True        1
2021-01-01 00:06:00   False       6
2021-01-01 00:06:00   True        6    

The result then becomes something like this:

int_var  bool_var  date                
1        True      2021-01-01 00:03:00  1
                   2021-01-01 00:04:00  0
                   2021-01-01 00:05:00  0
                   2021-01-01 00:06:00  0

6        True      2021-01-01 00:03:00  0
                   2021-01-01 00:04:00  0
                   2021-01-01 00:05:00  0
                   2021-01-01 00:06:00  1
6        False     2021-01-01 00:03:00  0
                   2021-01-01 00:04:00  0
                   2021-01-01 00:05:00  0
                   2021-01-01 00:06:00  1

This is exactly what I want. However, as you can see the data starts a bit after midnight, and I want those minutes from midnight to be in there as well. So I append a row for each bool_var / int_var combination at 2021-01-01 00:00:00, to make sure the resampling starts from there.

rows = []
some for loop:
   rows.append()

extra_rows_df = pd.DataFrame(rows, columns=['date', 'bool_var', 'int_var'])

B = pd.concat([A, extra_rows_df], ignore_index=True)

The resulting dataframe B appear to be correct, and in the same format as dataframe A:

date                  bool_var    int_var
2021-01-01 00:00:00   True        1   
2021-01-01 00:03:00   True        1
2021-01-01 00:00:00   False       6
2021-01-01 00:06:00   False       6
2021-01-01 00:00:00   True        6   
2021-01-01 00:06:00   True        6   

However, if I run the exact same groupby and resample command on dataframe B. My results are all weird:

date               2021-01-01 00:00:00 ... 2021-12-31 23:59:00
int_var  bool_var  1                   ... 1                
1        True      

6        True      
         False

It is like each date suddenly became a column instead of being listed for each grouping.

TL;DR: use stack() .

I figured it out. In dataframe A, every bool_var / int_var group has different datetime values; here (1, True) started with 00:03, but some other group, eg (2, True) could start with an entry at 01:14. Once I filled out dataframe A so that each group had an entry at 00:00 in dataframe B, and I resampled to fill in each minute, every group had each datetime. In this way, all those datetimes could become columns since they apply to each group.

The solution is to use stack() on this final result

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM