[英]How to only keep rows which have more than one value in a pandas DataFrame?
I often try to do the following operation, but there's an immediate solution which is most efficient in pandas: 我经常尝试进行以下操作,但有一个在熊猫中效率最高的即时解决方案:
I have the following example pandas DataFrame, whereby there are two columns, Name
and Age
: 我有以下示例pandas DataFrame,其中有两列,
Name
和Age
:
import pandas as pd
data = [['Alex',10],['Bob',12],['Barbara',25], ['Bob',72], ['Clarke',13], ['Clarke',13], ['Destiny', 45]]
df = pd.DataFrame(data,columns=['Name','Age'], dtype=float)
print(df)
Name Age
0 Alex 10.0
1 Bob 12.0
2 Barbara 25.0
3 Bob 72.0
4 Clarke 13.0
5 Clarke 13.0
6 Destiny 45.0
I would like to remove all rows which do have a matching value in Name
. 我想删除
Name
具有匹配值的所有行。 In the example df
, there are two Bob
values and two Clarke
values. 在示例
df
,有两个Bob
值和两个Clarke
值。 The intended output would therefore be: 因此,预期的输出是:
Name Age
0 Bob 12.0
1 Bob 72.0
2 Clarke 13.0
3 Clarke 13.0
whereby I'm assuming that there's a reset index. 我假设有一个重置索引。
One option would be to keep all unique values for Name
in a list, and then iterate through the dataframe to check for duplicate rows. 一种选择是在列表中保留
Name
所有唯一值,然后遍历数据帧以检查重复行。 That would be very inefficient. 那将是非常低效的。
Is there a built-in function for this task? 这个任务有内置函数吗?
Use drop_duplicates
, and only get the ones that are dropped: 使用
drop_duplicates
,只获取被删除的内容:
print(df[~df['Name'].isin(df['Name'].drop_duplicates(False))])
Output: 输出:
Name Age
1 Bob 12.0
3 Bob 72.0
4 Clarke 13.0
5 Clarke 13.0
If care about the index, do: 如果关心索引,请执行以下操作:
print(df[~df['Name'].isin(df['Name'].drop_duplicates(False))].reset_index(drop=1))
Output: 输出:
Name Age
0 Bob 12.0
1 Bob 72.0
2 Clarke 13.0
3 Clarke 13.0
Using duplicated
使用
duplicated
df[df.Name.duplicated(keep=False)]
Name Age
1 Bob 12.0
3 Bob 72.0
4 Clarke 13.0
5 Clarke 13.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.