简体   繁体   English

Pandas使用两个数据帧进行布尔索引

[英]Pandas Boolean indexing with two dataframes

I have two pandas dataframes: 我有两个pandas数据帧:

df1
'A' 'B'
 0   0
 0   2
 1   1
 1   1
 1   3

df2
'ID' 'value'
 0   62
 1   70
 2   76
 3   4674
 4   3746

I want to assign df.value as a new column D to df1, but just when df.A == 0 . 我想将df.value指定为df1的新列D ,但是当df.A == 0 df1.B and df2.ID are supposed to be the identifiers. df1.Bdf2.ID应该是标识符。

Example output: 示例输出:

df1
'A' 'B' 'D'
 0   0   62
 0   2   76
 1   1   NaN
 1   1   NaN
 1   3   NaN

I tried the following: 我尝试了以下方法:

df1['D'][ df1.A == 0 ] = df2['value'][df2.ID == df1.B]

However, since df2 and df1 don't have the same length, I get the a ValueError. 但是,由于df2和df1的长度不同,我得到一个ValueError。

ValueError: Series lengths must match to compare

This is quite certainly due to the boolean indexing in the last part: [df2.ID == df1.B] 这肯定是由于最后一部分的布尔索引: [df2.ID == df1.B]

Does anyone know how to solve the problem without needing to iterate over the dataframe(s)? 有谁知道如何解决问题而无需迭代数据帧?

Thanks a bunch! 谢谢你!

============== ==============

Edit in reply to @EdChum: It worked perfectly with the example data, but I have issues with my real data. 编辑回复@EdChum:它与示例数据完美配合,但我的实际数据存在问题。 df1 is a huge dataset. df1是一个庞大的数据集。 df2 looks like this: df2看起来像这样:

df2
    ID  value
0   1   1.00000
1   2   1.00000
2   3   1.00000
3   4   1.00000
4   5   1.00000
5   6   1.00000
6   7   1.00000
7   8   1.00000
8   9   0.98148
9   10  0.23330
10  11  0.56918
11  12  0.53251
12  13  0.58107
13  14  0.92405
14  15  0.00025
15  16  0.14863
16  17  0.53629
17  18  0.67130
18  19  0.53249
19  20  0.75853
20  21  0.58647
21  22  0.00156
22  23  0.00000
23  24  0.00152
24  25  1.00000

After doing the merging, the output is the following: first 133 times 0.98148, then 47 times 0.00025 and then it continues with more sequences of values from df2 until finally a sequence of NaN entries appear... 在进行合并之后,输出如下:首先是133次0.98148,然后是47次0.00025然后它继续从df2获得更多的值序列,直到最后出现一系列NaN条目...

Out[91]: df1
    A   B   D
0   1   3   0.98148
1   0   9   0.98148
2   0   9   0.98148
3   0   7   0.98148
5   1   21  0.98148
7   1   12  0.98148
...     ...     ...     ...
2592    0   2   NaN
2593    1   17  NaN
2594    1   16  NaN
2596    0   17  NaN
2597    0   6   NaN

Any idea what might have happened here? 知道这里可能发生了什么吗? They are all int64. 它们都是int64。

============== ==============

Here are two csv with data that reproduces the problem. 这里有两个带有重现问题数据的csv。

df1: https://owncloud.tu-berlin.de/public.php?service=files&t=2a7d244f55a5772f16aab364e78d3546 df1: https ://owncloud.tu-berlin.de/public.php service = files t = 2a7d244f55a5772f16aab364e78d3546

df2: https://owncloud.tu-berlin.de/public.php?service=files&t=6fa8e0c2de465cb4f8a3f8890c325eac df2: https ://owncloud.tu-berlin.de/public.php service = files t = 6fa8e0c2de465cb4f8a3f8890c325eac

To reproduce: 重现:

import pandas as pd

df1 = pd.read_csv("../../df1.csv")
df2 = pd.read_csv("../../df2.csv")

df1['D'] = df1[df1.A == 0].merge(df2,left_on='B', right_on='ID', how='left')['value']

Slightly tricky this one, there are 2 steps here, first is to select only the rows in df where 'A' is 0, then merge to this the other df where 'B' and 'ID' match but perform a 'left' merge, then select the 'value' column from this and assign to the df: 稍微有点棘手,这里有2个步骤,首先是只选择df中'A'为0的行,然后合并到另一个df,'B'和'ID'匹配,但执行'left'合并,然后从中选择'value'列并分配给df:

In [142]:

df['D'] = df[df.A == 0].merge(df1, left_on='B',right_on='ID', how='left')['value']
df
Out[142]:
   A  B   D
0  0  0  62
1  0  2  76
2  1  1 NaN
3  1  1 NaN
4  1  3 NaN

Breaking this down will show what is happening: 打破这种情况将显示正在发生的事情:

In [143]:
# boolean mask on condition
df[df.A == 0]
Out[143]:
   A  B   D
0  0  0  62
1  0  2  76
In [144]:
# merge using 'B' and 'ID' columns
df[df.A == 0].merge(df1, left_on='B',right_on='ID', how='left')
Out[144]:
   A  B   D  ID  value
0  0  0  62   0     62
1  0  2  76   2     76

After all the above you can then assign directly: 完成上述所有操作后,您可以直接分配:

df['D'] = df[df.A == 0].merge(df1, left_on='B',right_on='ID', how='left')['value']

This works as it will align with the left hand side idnex so any missing values will automatically be assigned NaN 这样可以与左侧的idnex对齐,因此任何缺失的值都将自动分配给NaN

EDIT 编辑

Another method and one that seems to work for your real data is to use map to perform the lookup for you, map accepts a dict or series as a param and will lookup the corresponding value, in this case you need to set the index to 'ID' column, this reduces your df to one with just the 'Value' column: 另一个似乎适用于您的真实数据的方法是使用map为您执行查找, map接受dict或系列作为参数并查找相应的值,在这种情况下您需要将索引设置为' ID'列,仅使用'Value'列将df降低为1:

df['D'] = df[df.A==0]['B'].map(df1.set_index('ID')['value'])

So the above performs boolean indexing as before and then calls map on the 'B' column and looksup the corresponding 'Value' in the other df after we set the index on 'ID'. 所以上面按照之前的方式执行布尔索引,然后在'B'列上调用map ,并在我们在'ID'上设置索引后查找其他df中相应的'Value'。

Update 更新

I looked at your data and my first method and I can see why this fails, the alignment to the left hand side df fails so you get 1192 values in a continuous row and then the rest of the rows are NaN up to row 2500. 我查看了你的数据和我的第一个方法,我可以看到为什么会失败,左侧df的对齐失败,所以你在连续的行中得到1192个值,然后其余的行是NaN直到第2500行。

What does work is if you apply the same mask to the left hand side like so: 如果您将相同的蒙版应用于左侧,那么工作原理是什么:

df1.loc[df1.A==0, 'D'] = df1[df1.A == 0].merge(df2,left_on='B', right_on='ID', how='left')['value']

So this masks the rows on the left hand side correctly and assigns the result of the merge 因此,这会正确遮盖左侧的行并分配合并的结果

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM