I am trying to filter a dataframe in pandas, using the groupby function. The aim is to take the earliest (by date)instance of each variable for each id.
Eventually I was able to solve the problem in R using tidyr like so:
df_mins <- df %>%
group_by(id, variable) %>%
slice(which.min(as.Date(date)))
I also achieved something close using pandas which looked like this:
df.groupby(['id', 'variable'])['date'].transform(min) == df['date']
however the resulting df had more than one (non unique) entry per variable. any ideas what im doing wrong?
Since you have duplicate for min date
m=df.groupby(['id', 'variable'])['date'].transform(min) == df['date']
df=df[m].drop_duplicates(['id', 'variable'])
Also in R we can do
df=df[order(df$date),]
df=df[!duplicated(df[c('id', 'variable')]),]
Same in pandas
df=df.sort_values(['date']).drop_duplicates(['id', 'variable'])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.