根据两个数据框和条件添加一个新列

Question

How can I add a new column based on two dataframes and conditions?如何根据两个数据框和条件添加新列？ For example, if df2['x'] is between df1['x']±2.5 and df2['y'] is between df1['y']±2.5, give 1 otherwise 0.例如，如果 df2['x'] 在 df1['x']±2.5 之间并且 df2['y'] 在 df1['y']±2.5 之间，则给 1 否则为 0。

import pandas as pd
data = {'x': [40.1, 50.1, 60.1, 70.1, 80.1, 90.1, 0, 300.1 ], 'y': [100.1, 110.1, 120.1, 130.1, 140.1, 150.1, 160.1, 400.1], 'year': [2000, 2000, 2001, 2001, 2003, 2003, 2003, 2004]}   
df = pd.DataFrame(data)
df              

     x        y     year
0   40.1    100.1   2000
1   50.1    110.1   2000
2   60.1    120.1   2001
3   70.1    130.1   2001
4   80.1    140.1   2003
5   90.1    150.1   2003
6   0.0     160.1   2003
7   300.1   400.1   2004

df2 df2

data2 = {'x': [92.2, 30.1, 82.6, 51.1, 39.4, 10.1, 0, 299.1], 'y': [149.3, 100.1, 139.4, 111.1, 100.8, 180.1, 0, 402.5], 'year': [1950, 1951, 1952, 2000, 2000, 1954, 1955, 2004]}  
df2 = pd.DataFrame(data2)
df2

     x        y     year
0   92.2    149.3   1950
1   30.1    100.1   1951
2   82.6    139.4   1952
3   51.1    111.1   2000
4   39.4    100.8   2000
5   10.1    180.1   1954
6   0.0     0.0     1955
7   299.1   402.5   2004

Output: df Output：DF

new_col = []
for i in df.index:
if ((df['x'].iloc[i] - 2.5) < df2['x'].iloc[i] < (df['x'].iloc[i] + 2.5) and 
    (df['y'].iloc[i] - 2.5) < df2['y'].iloc[i] < (df['y'].iloc[i] + 2.5) and 
    df['year'].iloc[i] == df2['year'].iloc[i]):
    out = 1
else:
    out = 0
       
if out == 1:
    new_coll.append(1)
else: 
    new_col.append(0)
df['Result'] = new_col
df
            
      x       y     year   Result
0   40.1    100.1   2000    0
1   50.1    110.1   2000    0
2   60.1    120.1   2001    0
3   70.1    130.1   2001    0
4   80.1    140.1   2003    0
5   90.1    150.1   2003    0
6   0.0     160.1   2003    0
7   300.1   400.1   2004    1

But the output is not correct in terms of what i want.但是 output 就我想要的而言是不正确的。 It just compare row by row.它只是逐行比较。 I want to find: Is the first row in df inside df2 according to conditions?我想查找： df 中的第一行是否在 df2 中根据条件？ It means check all rows in df2 for each row in df.这意味着为 df 中的每一行检查 df2 中的所有行。 So the expected output should be as below:所以预期的 output 应该如下所示：

Expected output: df预计 output：df

As you can see, 3 rows satisfy the conditions:
0 in df --> 4 in df2
1 in df --> 3 in df2
7 in df --> 7 in df2
    
So expected output:

     x        y     year   Result
0   40.1    100.1   2000    1
1   50.1    110.1   2000    1
2   60.1    120.1   2001    0
3   70.1    130.1   2001    0
4   80.1    140.1   2003    0
5   90.1    150.1   2003    0
6   0.0     160.1   2003    0
7   300.1   400.1   2004    1

Answer 1

I have found this code to work but please comment if it does not:我发现此代码有效，但如果无效请发表评论：

import pandas as pd
data = {'x': [431228.6013, 431233.6013], 'y': [4522094.758, 4522094.758]}   
df = pd.DataFrame(data)
data2 = {'x': [431226.7421, 431280.9052], 'y': [4522093.800, 4522060.532]}  
df2 = pd.DataFrame(data2)
new_col = []
for i in df.index:
    symbol = 'x'
    if 2.5 <= df[symbol].iloc[i] <= df2[symbol].iloc[i] or 2.5 >= df[symbol].iloc[i] >= df2[symbol].iloc[i]:
        x_out = 1
    else:
        x_out = 0
    symbol = 'y'
    if 2.5 <= df[symbol].iloc[i] <= df2[symbol].iloc[i] or 2.5 >= df[symbol].iloc[i] >= df2[symbol].iloc[i]:
        y_out = 1
    else:
        y_out = 0
    
    if x_out == y_out:
        new_col.append(1)
    else: 
        new_col.append(0)
df['Result'] = new_col

With this I got the answers that you expected above.有了这个，我得到了你上面期望的答案。 Also, the df and df2 have to be the same length for this to work.此外，df 和 df2 的长度必须相同才能起作用。

Hope this helped!希望这有帮助！

Answer 2

One-line solution:一线解决方案：

df['Result'] = (df - df2).abs().le(2.5).all(axis=1).astype(int)

Explanation: this relies on most operators and functions on DataFrames and Series being vectorized : not just arithmetic and logical expressions .le() , .all() / .any() , .sum() , .apply() all take an optional (...axis=1) argument.说明：这依赖于 DataFrame 和 Series 上的大多数运算符和函数被矢量化：不仅仅是算术和逻辑表达式 .le( .le() 、 .all( .all() / .any() 、 .sum() 、 .apply()都采用可选的(...axis=1)参数。

First, slice the two columns of interest, vector-subtract them, compare the absolute value of the difference to 2.5 (instead of the three-way comparison -2.5 <... < 2.5 ):首先，对感兴趣的两列进行切片，对它们进行向量减法，将差值的绝对值与 2.5 进行比较（而不是三向比较-2.5 <... < 2.5 ）：

(df - df2)[['x','y']].abs().le(2.5)

       x      y
0  False  False
1  False  False
2  False  False
3  False  False
4  False  False
5  False  False
6   True  False
7   True   True

Now for each row (..., axis=1) we need to logical-and the columns into a boolean value, which we can then convert to int:现在，对于每一行(..., axis=1) ，我们需要对列进行逻辑与运算，将其转换为 boolean 值，然后我们可以将其转换为 int：

(df - df2)[['x','y']].abs().le(2.5).all(axis=1).astype(int)

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    1

Note:笔记：

vectorization is faster, usually gives clearer, shorter code (avoid all that repetitive clunky df['x'].iloc[i] ), and multiple operations/functions can be arbitrarily composed, as we do here.矢量化更快，通常会给出更清晰、更短的代码（避免所有重复的笨拙df['x'].iloc[i] ），并且可以任意组合多个操作/函数，就像我们在这里所做的那样。
in your case, you want to take columns ['x', 'y', 'year'] all from df , then concatenate df['Result'] .在你的情况下，你想从df中获取列['x', 'y', 'year'] ，然后连接df['Result'] 。 So essentially everything comes from df and we're just appending one new column.所以基本上一切都来自df而我们只是附加一个新列。 We don't even need to do pd.concat([df, [...], axis=1) , we might as well just directly assign df['Result'] , it gets appended.我们甚至不需要执行pd.concat([df, [...], axis=1) ，我们不妨直接分配df['Result'] ，它会被附加。

根据两个数据框和条件添加一个新列

问题描述

2 个解决方案

解决方案1
0 2022-03-01 00:40:24

解决方案2
0 2022-03-03 02:02:58

根据两个数据框和条件添加一个新列

问题描述

2 个解决方案

解决方案1 0 2022-03-01 00:40:24

解决方案2 0 2022-03-03 02:02:58

解决方案1
0 2022-03-01 00:40:24

解决方案2
0 2022-03-03 02:02:58