[英]How to remove duplicated rows based on values in column
Good morning guys.早上好家伙。 So my problem is to remove duplicates from dataframe caused by many diffrent values in one of columns.所以我的问题是从 dataframe 中删除由其中一列中的许多不同值引起的重复项。
The base dataframe looks like this below:基础 dataframe 如下所示:
As You can see, I have duplicated values in columns Name and Id depends on Category.如您所见,我在 Name 和 Id 列中有重复的值取决于类别。 Our goal is to remove those duplicates while keeping the information about category.我们的目标是删除这些重复项,同时保留有关类别的信息。
I would like to have the exact view as here below:我想有如下的确切视图:
I have tried to use get_dummies method from pandas library but i have some issues.我尝试使用 pandas 库中的 get_dummies 方法,但我有一些问题。
dummies = pd.get_dummies(df[['Category']], drop_first=True)
df = pd.concat([df.drop(['Category'], axis=1), dummies], axis=1)
Using the code above i'm getting the result like this below:使用上面的代码,我得到如下结果:
The result is basicly still the same as base dataframe.结果基本还是和base dataframe一样。
Do You guys have any idea how to deal with it?你们知道如何处理它吗?
It depends what need - if possible duplicates per Name
and Id
is necessary aggregate max
:这取决于需要什么 - 如果可能的话,每个Name
和Id
重复是必要的聚合max
:
df = (pd.get_dummies(df, columns=['Category'])
.groupby(['Name','Id'], as_index=False)
.max())
print (df)
Name Id Category_A Category_B Category_C
0 ABC 1 1 0 0
1 ABC 2 0 1 0
2 DEF 2 1 0 0
3 GHI 3 0 0 1
4 JKL 4 0 0 1
5 MNO 5 1 0 0
If need aggregate per Id
with last value for non numeric values use:如果需要聚合每个Id
以及非数值的最后一个值,请使用:
f = lambda x: x.max() if np.issubdtype(x.dtype, np.number) else x.iat[-1]
df = (pd.get_dummies(df, columns=['Category'])
.groupby('Id', as_index=False)
.agg(f))
print (df)
Id Name Category_A Category_B Category_C
0 1 ABC 1 0 0
1 2 DEF 1 1 0
2 3 GHI 0 0 1
3 4 JKL 0 0 1
4 5 MNO 1 0 0
In second solution is possible specify columns for aggregations:在第二种解决方案中,可以为聚合指定列:
# f = lambda x: x.max() if np.issubdtype(x.dtype, np.number) else x.iat[-1]
df = pd.get_dummies(df, columns=['Category'])
d = dict.fromkeys(df.columns, 'max')
d['Name'] = 'last'
print (d)
{'Name': 'last', 'Id': 'max', 'Category_A': 'max', 'Category_B': 'max', 'Category_C': 'max'}
df = df.groupby('Id', as_index=False).agg(d)
print (df)
Name Id Category_A Category_B Category_C
0 ABC 1 1 0 0
1 DEF 2 1 1 0
2 GHI 3 0 0 1
3 JKL 4 0 0 1
4 MNO 5 1 0 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.