简体   繁体   中英

Groupby in pandas returning too many rows

I am trying to filter a dataframe in pandas, using the groupby function. The aim is to take the earliest (by date)instance of each variable for each id.

Eventually I was able to solve the problem in R using tidyr like so:

  df_mins <-  df %>% 
   group_by(id, variable) %>%
   slice(which.min(as.Date(date)))

I also achieved something close using pandas which looked like this:

df.groupby(['id', 'variable'])['date'].transform(min) == df['date']

however the resulting df had more than one (non unique) entry per variable. any ideas what im doing wrong?

Since you have duplicate for min date

m=df.groupby(['id', 'variable'])['date'].transform(min) == df['date']
df=df[m].drop_duplicates(['id', 'variable'])

Also in R we can do

df=df[order(df$date),]

df=df[!duplicated(df[c('id', 'variable')]),]

Same in pandas

df=df.sort_values(['date']).drop_duplicates(['id', 'variable'])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM