简体   繁体   English

Pandas - 在不同数据帧的两列之间“选择条件存在的位置”

[英]Pandas - "select where a condition exists" between two columns of different dataframes

I have two dataframes:我有两个数据框:

data1 = {
    'id': [1,1,2,2],
    'tag': [700,800,700,800],
    'Membership': [1,0.9,0.8,0.7],
}
data2 = {
    'id': [1,2,3,3],
    'tag': [700,800,600,500],
    'Membership': [0.5,0.9,0.8,0.7],
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

Which looks like:看起来像:

>>> df1
      id   tag  Membership
0      1   700         1.0
1      1   800         0.9
2      2   700         0.8
3      2   800         0.7

>>> df2
      id   tag  Membership
0      1   700         0.5
1      2   800         0.9
2      3   600         0.8
3      3   500         0.7

I want to add rows from df1 to df2, where the combination of (id,tag) doesn't exist in df2 .我想将行从 df1 添加到 df2,其中(id,tag)的组合在 df2 中不存在 So, any rows without the condition of df1['id'] == df2['id'] and df1['tag'] == df2['tag'] should be added to df2:因此,任何没有df1['id'] == df2['id'] and df1['tag'] == df2['tag']条件的行都应该添加到 df2:

>>> df2
      id   tag  Membership
0      1   700         0.5
1      2   800         0.9
2      3   600         0.8
3      3   500         0.7
4      1   800         0.9  # This row added
5      2   700         0.8  # This row added

What I've tried:我试过的:

I tried to find the rows where my condition is not true, then appending the result to df2:我试图找到我的条件不成立的行,然后将结果附加到 df2:

new_rows = df1[~((df1['id'] == df2['id']) & (df1['tag'] == df2['tag']))]
df2 = df2.append(new_rows).reset_index(drop=True)

But as you can see, I'm getting wrong result, because (id,tag) pair of (2,800) already exists in df2.但是正如你所看到的,我得到了错误的结果,因为 (id,tag) 对 (2,800) 已经存在于 df2. why is that?这是为什么?

>>> df2
      id   tag  Membership
0      1   700         0.5
1      2   800         0.9
2      3   600         0.8
3      3   500         0.7
4      1   800         0.9  # correct
5      2   700         0.8  # correct
6      2   800         0.7  # THIS SHOULDN't BE ADDED

Solution using combine_first :使用combine_first解决方案:

indices = ['id', 'tag']
left = df2.set_index(indices)
right = df1.set_index(indices)
combined = left.combine_first(right).reset_index()
combined

Result:结果:

   id  tag  Membership
0   1  700         0.5
1   1  800         0.9
2   2  700         0.8
3   2  800         0.9
4   3  500         0.7
5   3  600         0.8

The equality operator that you're using in this condition:您在这种情况下使用的相等运算符:

(df1['id'] == df2['id']) & (df1['tag'] == df2['tag'])

is not the right tool for this job, it doesn't work like you expect, it compares the dataframes row by row.不是这项工作的正确工具,它不像您期望的那样工作,它逐行比较数据帧。 Let's start with a simpler case:让我们从一个更简单的案例开始:

In [5]: df1['id'] == df2['id']
Out[5]: 
0     True
1    False
2    False
3    False
Name: id, dtype: bool

Id 1 is found on row 0 in both series, so you get True.在两个系列的第 0 行都可以找到 ID 1,因此您得到 True。 Id 2 is present in both series, but never on the same row, the position never matches. Id 2 出现在两个系列中,但从不在同一行,position 从不匹配。 Same for the tag : tag相同:

In [6]: df1['tag'] == df2['tag']
Out[6]: 
0     True
1     True
2    False
3    False
Name: tag, dtype: bool

So when you combine with & , only the first row matches:因此,当您与&结合使用时,只有第一行匹配:

In [7]: (df1['id'] == df2['id']) & (df1['tag'] == df2['tag'])
Out[7]: 
0     True
1    False
2    False
3    False
dtype: bool

which is why the (id,tag) pair of (2,800) is not recognized as being already present.这就是为什么 (2,800) 的 (id,tag) 对不被识别为已经存在的原因。

So instead of the equality operator, you should use merge as suggested by the other answers.因此,您应该按照其他答案的建议使用merge ,而不是相等运算符。

Here you go:这里是 go:

df3 = pd.merge(df2,df1,on=['id','tag'],how='outer')
df3 = df3[df3.Membership_x.isna()][['id','tag','Membership_y']].rename(columns={'Membership_y':'Membership'})
df3 = df2.append(df3)
df3

Prints印刷

id  tag Membership
0   1   700 0.5
1   2   800 0.9
2   3   600 0.8
3   3   500 0.7
4   1   800 0.9
5   2   700 0.8

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM