如何根据两个数据框和条件添加新列

Question

How can I add a new column based on two dataframes and conditions?如何根据两个数据框和条件添加新列？ For example, if df2['x'] is between df1['x']±2.5 and df2['y'] is between df1['y']±2.5, give 1 otherwise 0.例如，如果 df2['x'] 在 df1['x']±2.5 之间并且 df2['y'] 在 df1['y']±2.5 之间，则给 1 否则为 0。

import pandas as pd
data = {'x': [40.1, 50.1, 60.1, 70.1, 80.1, 90.1, 0, 300.1 ], 'y': [100.1, 110.1, 120.1, 130.1, 140.1, 150.1, 160.1, 400.1], 'year': [2000, 2000, 2001, 2001, 2003, 2003, 2003, 2004]}   
df = pd.DataFrame(data)
df              

     x        y     year
0   40.1    100.1   2000
1   50.1    110.1   2000
2   60.1    120.1   2001
3   70.1    130.1   2001
4   80.1    140.1   2003
5   90.1    150.1   2003
6   0.0     160.1   2003
7   300.1   400.1   2004

df2 df2

data2 = {'x': [92.2, 30.1, 82.6, 51.1, 39.4, 10.1, 0, 299.1], 'y': [149.3, 100.1, 139.4, 111.1, 100.8, 180.1, 0, 402.5], 'year': [1950, 1951, 1952, 2000, 2000, 1954, 1955, 2004]}  
df2 = pd.DataFrame(data2)
df2

     x        y     year
0   92.2    149.3   1950
1   30.1    100.1   1951
2   82.6    139.4   1952
3   51.1    111.1   2000
4   39.4    100.8   2000
5   10.1    180.1   1954
6   0.0     0.0     1955
7   299.1   402.5   2004

Output: df Output：DF

new_col = []
for i in df.index:
if ((df['x'].iloc[i] - 2.5) < df2['x'].iloc[i] < (df['x'].iloc[i] + 2.5) and 
    (df['y'].iloc[i] - 2.5) < df2['y'].iloc[i] < (df['y'].iloc[i] + 2.5) and 
    df['year'].iloc[i] == df2['year'].iloc[i]):
    out = 1
else:
    out = 0
       
if out == 1:
    new_coll.append(1)
else: 
    new_col.append(0)
df['Result'] = new_col
df
            
      x       y     year   Result
0   40.1    100.1   2000    0
1   50.1    110.1   2000    0
2   60.1    120.1   2001    0
3   70.1    130.1   2001    0
4   80.1    140.1   2003    0
5   90.1    150.1   2003    0
6   0.0     160.1   2003    0
7   300.1   400.1   2004    1

But the output is not correct in terms of what i want.但是 output 就我想要的而言是不正确的。 It just compare row by row.它只是逐行比较。 I want to find: Is the first row in df inside df2 according to conditions?我想查找： df 中的第一行是否在 df2 中根据条件？ It means check all rows in df2 for each row in df.这意味着为 df 中的每一行检查 df2 中的所有行。 So the expected output should be as below:所以预期的 output 应该如下所示：

Expected output: df预计 output：df

As you can see, 3 rows satisfy the conditions:
0 in df --> 4 in df2
1 in df --> 3 in df2
7 in df --> 7 in df2
    
So expected output:

     x        y     year   Result
0   40.1    100.1   2000    1
1   50.1    110.1   2000    1
2   60.1    120.1   2001    0
3   70.1    130.1   2001    0
4   80.1    140.1   2003    0
5   90.1    150.1   2003    0
6   0.0     160.1   2003    0
7   300.1   400.1   2004    1

Answer 1

This is the alternative solution with Pandas vectorization.这是 Pandas 矢量化的替代解决方案。 If your dataframe is small, you won't get much performance burden from for loop, however, for scalability and for Pandas best practice perspective, you can take a look at the vectorization in Pandas.如果你的 dataframe 很小，你不会从 for 循环中得到太多的性能负担，但是，为了可扩展性和 Pandas 最佳实践的观点，你可以看看 Pandas 中的矢量化。

Thanks to @Timus's comment, you can first merge the 2 dataframes with left on year .感谢@Timus 的评论，您可以先将 2 个数据框与left on year合并。

dfa = df.merge(df2, on='year', how='left', suffixes=('1', '2'))

Then, apply the conditions.然后，应用条件。

dfa['Result'] = ((dfa.x2 > dfa.x1 - 2.5) & 
                (dfa.x2 < dfa.x1 + 2.5) & 
                (dfa.y2 > dfa.y1 - 2.5) & 
                (dfa.y2 < dfa.y1 + 2.5))

Finally, you group by the df's x, y, year (x1, y1, year) and return True if any row's Result is True .最后，您按 df 的 x, y, year (x1, y1, year) 分组，如果任何行的 Result 为True True

# any() returns True if there is at least 1 True in Result per group.
dfa = dfa.groupby(['x1', 'y1', 'year']).Result.any().astype(int).reset_index()

Result结果

      x1     y1   year  Result
0    0.0  160.1   2003       0
1   40.1  100.1   2000       1
2   50.1  110.1   2000       1
3   60.1  120.1   2001       0
4   70.1  130.1   2001       0
5   80.1  140.1   2003       0
6   90.1  150.1   2003       0
7  300.1  400.1   2004       1

Answer 2

You can loop through each DataFrame and check for all combinations.您可以遍历每个 DataFrame 并检查所有组合。

for index, row in df.iterrows():
    for index2, row2 in df2.iterrows():
        if  (row['x']-2.5 < row2['x']  < row['x']+2.5) and (row['y']-2.5 < row2['y']  < row['y']+2.5):
            print(index,index2)
            df.loc[index, 'Result'] = 1

如何根据两个数据框和条件添加新列

问题描述

2 个解决方案

解决方案1
1 已采纳 2022-03-02 20:57:51

解决方案2
0 2022-03-02 20:34:22

如何根据两个数据框和条件添加新列

问题描述

2 个解决方案

解决方案1 1 已采纳 2022-03-02 20:57:51

解决方案2 0 2022-03-02 20:34:22

解决方案1
1 已采纳 2022-03-02 20:57:51

解决方案2
0 2022-03-02 20:34:22