Groupby in pandas returning too many rows

Question

I am trying to filter a dataframe in pandas, using the groupby function. The aim is to take the earliest (by date)instance of each variable for each id.

Eventually I was able to solve the problem in R using tidyr like so:

  df_mins <-  df %>% 
   group_by(id, variable) %>%
   slice(which.min(as.Date(date)))

I also achieved something close using pandas which looked like this:

df.groupby(['id', 'variable'])['date'].transform(min) == df['date']

however the resulting df had more than one (non unique) entry per variable. any ideas what im doing wrong?

Answer 1

Since you have duplicate for min date

m=df.groupby(['id', 'variable'])['date'].transform(min) == df['date']
df=df[m].drop_duplicates(['id', 'variable'])

Also in R we can do

df=df[order(df$date),]

df=df[!duplicated(df[c('id', 'variable')]),]

Same in pandas

df=df.sort_values(['date']).drop_duplicates(['id', 'variable'])

Groupby in pandas returning too many rows

Question

1 answers

solution1
2 2020-06-12 16:04:15

Groupby in pandas returning too many rows

Question

1 answers

solution1 2 2020-06-12 16:04:15

solution1
2 2020-06-12 16:04:15