简体   繁体   English

根据pandas数据帧中的条件分组查找匹配的行

[英]Find matching rows based on a conditional grouping in a pandas dataframe

I've look everywhere for this answer but none seem to do what I need. 我到处寻找这个答案,但似乎没有人做我需要的。 Here's a dummy example of what I need: 这是我需要的一个虚拟例子:

data = {'id':[1, 2, 3, 4, 1, 1, 3, 4, 1], 
        'parent':['a', 'b', 'f', 'j', 'a', 'n', 'f', 'z', 'x'], 
        'vehicle':['car', 'car', 'truck', 'suv', 'car', 'hatch', 'truck', 'suv', 'car'], 
        'color':['red', 'blue', 'grey', 'green', 'red', 'purple', 'grey', 'green', 'red'],
        'serial': [324234, 23464, 5667, 1245, 786, 34546, 8537, 111111, 8376251537]}
df = pd.DataFrame(data)
df.sort_values(by=['id', 'parent'], inplace=True)

    id  parent  vehicle   color   serial
0   1   a        car      red     324234
4   1   a        car      red     786
5   1   n        hatch    purple  34546
8   1   x        car      red     8376251537
1   2   b        car      blue    23464
2   3   f        truck    grey    5667
6   3   f        truck    grey    8537
3   4   j        suv      green   1245
7   4   z        suv      green   111111

And what I need is to get all rows where the id is the same but the parent differs and the vehicle and color are the same. 我需要的是获得所有行,其中id是相同的但是父级不同 ,车辆和颜色是相同的。


So I want: 所以我想:

    id  parent  vehicle color   serial
0   1   a       car     red     324234
4   1   a       car     red     786
8   1   x       car     red     8376251537
3   4   j       suv     green   1245
7   4   z       suv     green   111111

Note that I want to include the top two of the above because they have a different serial number. 请注意,我想要包含上面的前两个,因为它们具有不同的序列号。 Edit: and they are part of a grouping that has differing parent w/ same id. 编辑:它们是具有不同父级w /相同ID的分组的一部分。


I've tried this and get close: 我试过这个并且接近:

target = df[df.duplicated(['id', 'vehicle', 'color'], keep=False)]

    id  parent  vehicle   color   serial
0   1   a       car       red     324234
4   1   a       car       red     786
8   1   x       car       red     8376251537
2   3   f       truck     grey    5667
6   3   f       truck     grey    8537
3   4   j       suv       green   1245
7   4   z       suv       green   111111

But I don't want the rows that have matching id, vehicle, color i f the corresponding parent is also the same . 但是我不希望具有匹配id,车辆,颜色i 的相应父级的行也是相同的 So in this case, I don't want 所以在这种情况下,我不想要

    id  parent  vehicle   color   serial
2   3   f       truck     grey    5667
6   3   f       truck     grey    8537

because they have the same parent. 因为他们有同一个父母。 I've thought about grouping and changing the index but what I'm doing isn't working. 我已经考虑过分组和更改索引,但我正在做的事情不起作用。 This seems like an easy problem and maybe it is, but I just cant's crack it! 这似乎是一个简单的问题,也许是,但我只是不能破解它!

IIUC, Let's try this: IIUC,让我们试试这个:

df[df.groupby(['id','vehicle','color'])['parent'].transform('nunique') > 1]

Output: 输出:

   id parent vehicle  color      serial
0   1      a     car    red      324234
4   1      a     car    red         786
8   1      x     car    red  8376251537
3   4      j     suv  green        1245
7   4      z     suv  green      111111

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM