如何根据列中的值删除重复的行

Question

Good morning guys.早上好家伙。 So my problem is to remove duplicates from dataframe caused by many diffrent values in one of columns.所以我的问题是从 dataframe 中删除由其中一列中的许多不同值引起的重复项。

The base dataframe looks like this below:基础 dataframe 如下所示：

As You can see, I have duplicated values in columns Name and Id depends on Category.如您所见，我在 Name 和 Id 列中有重复的值取决于类别。 Our goal is to remove those duplicates while keeping the information about category.我们的目标是删除这些重复项，同时保留有关类别的信息。

I would like to have the exact view as here below:我想有如下的确切视图：

I have tried to use get_dummies method from pandas library but i have some issues.我尝试使用 pandas 库中的 get_dummies 方法，但我有一些问题。

dummies = pd.get_dummies(df[['Category']], drop_first=True)
df = pd.concat([df.drop(['Category'], axis=1), dummies], axis=1)

Using the code above i'm getting the result like this below:使用上面的代码，我得到如下结果：

The result is basicly still the same as base dataframe.结果基本还是和base dataframe一样。

Do You guys have any idea how to deal with it?你们知道如何处理它吗？

Answer 1

It depends what need - if possible duplicates per Name and Id is necessary aggregate max :这取决于需要什么 - 如果可能的话，每个Name和Id重复是必要的聚合max ：

df = (pd.get_dummies(df, columns=['Category'])
        .groupby(['Name','Id'], as_index=False)
         .max())
print (df)
  Name  Id  Category_A  Category_B  Category_C
0  ABC   1           1           0           0
1  ABC   2           0           1           0
2  DEF   2           1           0           0
3  GHI   3           0           0           1
4  JKL   4           0           0           1
5  MNO   5           1           0           0

If need aggregate per Id with last value for non numeric values use:如果需要聚合每个Id以及非数值的最后一个值，请使用：

f = lambda x: x.max() if np.issubdtype(x.dtype, np.number) else x.iat[-1]
df = (pd.get_dummies(df, columns=['Category'])
        .groupby('Id', as_index=False)
         .agg(f))
print (df)
   Id Name  Category_A  Category_B  Category_C
0   1  ABC           1           0           0
1   2  DEF           1           1           0
2   3  GHI           0           0           1
3   4  JKL           0           0           1
4   5  MNO           1           0           0

In second solution is possible specify columns for aggregations:在第二种解决方案中，可以为聚合指定列：

# f = lambda x: x.max() if np.issubdtype(x.dtype, np.number) else x.iat[-1]
df = pd.get_dummies(df, columns=['Category'])
         
d = dict.fromkeys(df.columns, 'max')
d['Name'] = 'last'
print (d)
{'Name': 'last', 'Id': 'max', 'Category_A': 'max', 'Category_B': 'max', 'Category_C': 'max'}

df = df.groupby('Id', as_index=False).agg(d)
print (df)
  Name  Id  Category_A  Category_B  Category_C
0  ABC   1           1           0           0
1  DEF   2           1           1           0
2  GHI   3           0           0           1
3  JKL   4           0           0           1
4  MNO   5           1           0           0

如何根据列中的值删除重复的行

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-01-27 09:22:36

如何根据列中的值删除重复的行

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-01-27 09:22:36

解决方案1
1 已采纳 2022-01-27 09:22:36