简体   繁体   English

根据两个数据框和条件添加一个新列

[英]Add a new column based on two dataframes and conditions

How can I add a new column based on two dataframes and conditions?如何根据两个数据框和条件添加新列? For example, if df2['x'] is between df1['x']±2.5 and df2['y'] is between df1['y']±2.5, give 1 otherwise 0.例如,如果 df2['x'] 在 df1['x']±2.5 之间并且 df2['y'] 在 df1['y']±2.5 之间,则给 1 否则为 0。

import pandas as pd
data = {'x': [40.1, 50.1, 60.1, 70.1, 80.1, 90.1, 0, 300.1 ], 'y': [100.1, 110.1, 120.1, 130.1, 140.1, 150.1, 160.1, 400.1], 'year': [2000, 2000, 2001, 2001, 2003, 2003, 2003, 2004]}   
df = pd.DataFrame(data)
df              

     x        y     year
0   40.1    100.1   2000
1   50.1    110.1   2000
2   60.1    120.1   2001
3   70.1    130.1   2001
4   80.1    140.1   2003
5   90.1    150.1   2003
6   0.0     160.1   2003
7   300.1   400.1   2004

df2 df2

data2 = {'x': [92.2, 30.1, 82.6, 51.1, 39.4, 10.1, 0, 299.1], 'y': [149.3, 100.1, 139.4, 111.1, 100.8, 180.1, 0, 402.5], 'year': [1950, 1951, 1952, 2000, 2000, 1954, 1955, 2004]}  
df2 = pd.DataFrame(data2)
df2

     x        y     year
0   92.2    149.3   1950
1   30.1    100.1   1951
2   82.6    139.4   1952
3   51.1    111.1   2000
4   39.4    100.8   2000
5   10.1    180.1   1954
6   0.0     0.0     1955
7   299.1   402.5   2004

Output: df Output:DF

new_col = []
for i in df.index:
if ((df['x'].iloc[i] - 2.5) < df2['x'].iloc[i] < (df['x'].iloc[i] + 2.5) and 
    (df['y'].iloc[i] - 2.5) < df2['y'].iloc[i] < (df['y'].iloc[i] + 2.5) and 
    df['year'].iloc[i] == df2['year'].iloc[i]):
    out = 1
else:
    out = 0
       
if out == 1:
    new_coll.append(1)
else: 
    new_col.append(0)
df['Result'] = new_col
df
            
      x       y     year   Result
0   40.1    100.1   2000    0
1   50.1    110.1   2000    0
2   60.1    120.1   2001    0
3   70.1    130.1   2001    0
4   80.1    140.1   2003    0
5   90.1    150.1   2003    0
6   0.0     160.1   2003    0
7   300.1   400.1   2004    1

But the output is not correct in terms of what i want.但是 output 就我想要的而言是不正确的。 It just compare row by row.它只是逐行比较。 I want to find: Is the first row in df inside df2 according to conditions?我想查找: df 中的第一行是否在 df2 中根据条件? It means check all rows in df2 for each row in df.这意味着为 df 中的每一行检查 df2 中的所有行。 So the expected output should be as below:所以预期的 output 应该如下所示:

Expected output: df预计 output:df

As you can see, 3 rows satisfy the conditions:
0 in df --> 4 in df2
1 in df --> 3 in df2
7 in df --> 7 in df2
    
So expected output:

     x        y     year   Result
0   40.1    100.1   2000    1
1   50.1    110.1   2000    1
2   60.1    120.1   2001    0
3   70.1    130.1   2001    0
4   80.1    140.1   2003    0
5   90.1    150.1   2003    0
6   0.0     160.1   2003    0
7   300.1   400.1   2004    1

I have found this code to work but please comment if it does not:我发现此代码有效,但如果无效请发表评论:

import pandas as pd
data = {'x': [431228.6013, 431233.6013], 'y': [4522094.758, 4522094.758]}   
df = pd.DataFrame(data)
data2 = {'x': [431226.7421, 431280.9052], 'y': [4522093.800, 4522060.532]}  
df2 = pd.DataFrame(data2)
new_col = []
for i in df.index:
    symbol = 'x'
    if 2.5 <= df[symbol].iloc[i] <= df2[symbol].iloc[i] or 2.5 >= df[symbol].iloc[i] >= df2[symbol].iloc[i]:
        x_out = 1
    else:
        x_out = 0
    symbol = 'y'
    if 2.5 <= df[symbol].iloc[i] <= df2[symbol].iloc[i] or 2.5 >= df[symbol].iloc[i] >= df2[symbol].iloc[i]:
        y_out = 1
    else:
        y_out = 0
    
    if x_out == y_out:
        new_col.append(1)
    else: 
        new_col.append(0)
df['Result'] = new_col

With this I got the answers that you expected above.有了这个,我得到了你上面期望的答案。 Also, the df and df2 have to be the same length for this to work.此外,df 和 df2 的长度必须相同才能起作用。

Hope this helped!希望这有帮助!

One-line solution:一线解决方案:

df['Result'] = (df - df2).abs().le(2.5).all(axis=1).astype(int)

Explanation: this relies on most operators and functions on DataFrames and Series being vectorized : not just arithmetic and logical expressions .le() , .all() / .any() , .sum() , .apply() all take an optional (...axis=1) argument.说明:这依赖于 DataFrame 和 Series 上的大多数运算符和函数被矢量化:不仅仅是算术和逻辑表达式 .le( .le() 、 .all( .all() / .any().sum().apply()都采用可选的(...axis=1)参数。

First, slice the two columns of interest, vector-subtract them, compare the absolute value of the difference to 2.5 (instead of the three-way comparison -2.5 <... < 2.5 ):首先,对感兴趣的两列进行切片,对它们进行向量减法,将差值的绝对值与 2.5 进行比较(而不是三向比较-2.5 <... < 2.5 ):

(df - df2)[['x','y']].abs().le(2.5)

       x      y
0  False  False
1  False  False
2  False  False
3  False  False
4  False  False
5  False  False
6   True  False
7   True   True

Now for each row (..., axis=1) we need to logical-and the columns into a boolean value, which we can then convert to int:现在,对于每一行(..., axis=1) ,我们需要对列进行逻辑与运算,将其转换为 boolean 值,然后我们可以将其转换为 int:

(df - df2)[['x','y']].abs().le(2.5).all(axis=1).astype(int)

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    1

Note:笔记:

  • vectorization is faster, usually gives clearer, shorter code (avoid all that repetitive clunky df['x'].iloc[i] ), and multiple operations/functions can be arbitrarily composed, as we do here.矢量化更快,通常会给出更清晰、更短的代码(避免所有重复的笨拙df['x'].iloc[i] ),并且可以任意组合多个操作/函数,就像我们在这里所做的那样。
  • in your case, you want to take columns ['x', 'y', 'year'] all from df , then concatenate df['Result'] .在你的情况下,你想从df中获取列['x', 'y', 'year'] ,然后连接df['Result'] So essentially everything comes from df and we're just appending one new column.所以基本上一切都来自df而我们只是附加一个新列。 We don't even need to do pd.concat([df, [...], axis=1) , we might as well just directly assign df['Result'] , it gets appended.我们甚至不需要执行pd.concat([df, [...], axis=1) ,我们不妨直接分配df['Result'] ,它会被附加。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM