[英]Select rows from a Pandas DataFrame with same values in one column but different value in the other column
Say I have the pandas DataFrame below:假设我有下面的 Pandas DataFrame:
A B C D
1 foo one 0 0
2 foo one 2 4
3 foo two 4 8
4 cat one 8 4
5 bar four 6 12
6 bar three 7 14
7 bar four 7 14
I would like to select all the rows that have equal values in A but differing values in B. So I would like the output of my code to be:我想选择在 A 中具有相等值但在 B 中具有不同值的所有行。所以我希望我的代码的输出是:
A B C D
1 foo one 0 0
3 foo two 4 8
5 bar three 7 14
6 bar four 7 14
What's the most efficient way to do this?执行此操作的最有效方法是什么? I have approximately 11,000 rows with a lot of variation in the column values, but this situation comes up a lot.
我有大约 11,000 行,列值有很多变化,但这种情况经常出现。 In my dataset, if elements in column A are equal then the corresponding column B value should also be equal, however due to mislabeling this is not the case and I would like to fix this, it would be impractical for me to do this one by one.
在我的数据集中,如果 A 列中的元素相等,那么相应的 B 列值也应该相等,但是由于错误标记,情况并非如此,我想解决这个问题,我这样做是不切实际的一。
You can try groupby()
+ filter
+ drop_duplicates()
:您可以尝试
groupby()
+ filter
+ drop_duplicates()
:
>>> df.groupby('A').filter(lambda g: len(g) > 1).drop_duplicates(subset=['A', 'B'], keep="first")
A B C D
0 foo one 0 0
2 foo two 4 8
4 bar four 6 12
5 bar three 7 14
OR, in case you want to drop duplicates between the subset of columns A
& B
then can use below but that will have the row having cat
as well.或者,如果您想删除
A
列和B
列子集之间的重复项,则可以在下面使用,但该行也将包含cat
。
>>> df.drop_duplicates(subset=['A', 'B'], keep="first")
A B C D
0 foo one 0 0
2 foo two 4 8
3 cat one 8 4
4 bar four 6 12
5 bar three 7 14
Use groupby + filter + head :使用groupby + filter + head :
result = df.groupby('A').filter(lambda g: len(g) > 1).groupby(['A', 'B']).head(1)
print(result)
Output输出
A B C D
0 foo one 0 0
2 foo two 4 8
4 bar four 6 12
5 bar three 7 14
The first group-by and filter will remove the rows with no duplicated A
values (ie cat
), the second will create groups with same A, B
and for each of those get the first element.第一个 group-by 和 filter 将删除没有重复
A
值的行(即cat
),第二个将创建具有相同A, B
组A, B
并为每个组获取第一个元素。
The current answers are correct and may be more sophisticated too.当前的答案是正确的,也可能更复杂。 If you have complex criteria, filter function will be very useful.
如果您有复杂的标准, 过滤功能将非常有用。 If you are like me and want to keep things simple, i feel following is more beginner friendly way
如果你像我一样想保持简单,我觉得以下是更适合初学者的方式
>>> df = pd.DataFrame({
'A': ['foo', 'foo', 'foo', 'cat', 'bar', 'bar', 'bar'],
'B': ['one', 'one', 'two', 'one', 'four', 'three', 'four'],
'C': [0,2,4,8,6,7,7],
'D': [0,4,8,4,12,14,14]
}, index=[1,2,3,4,5,6,7])
>>> df = df.drop_duplicates(['A', 'B'], keep='last')
A B C D
2 foo one 2 4
3 foo two 4 8
4 cat one 8 4
6 bar three 7 14
7 bar four 7 14
>>> df = df[df.duplicated(['A'], keep=False)]
A B C D
2 foo one 2 4
3 foo two 4 8
6 bar three 7 14
7 bar four 7 14
keep='last'
is optional here keep='last'
在这里是可选的
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.