简体   繁体   English

如何根据两个数据框和条件添加新列

[英]How can I add a new column based on two dataframes and conditions

How can I add a new column based on two dataframes and conditions?如何根据两个数据框和条件添加新列? For example, if df2['x'] is between df1['x']±2.5 and df2['y'] is between df1['y']±2.5, give 1 otherwise 0.例如,如果 df2['x'] 在 df1['x']±2.5 之间并且 df2['y'] 在 df1['y']±2.5 之间,则给 1 否则为 0。

import pandas as pd
data = {'x': [40.1, 50.1, 60.1, 70.1, 80.1, 90.1, 0, 300.1 ], 'y': [100.1, 110.1, 120.1, 130.1, 140.1, 150.1, 160.1, 400.1], 'year': [2000, 2000, 2001, 2001, 2003, 2003, 2003, 2004]}   
df = pd.DataFrame(data)
df              

     x        y     year
0   40.1    100.1   2000
1   50.1    110.1   2000
2   60.1    120.1   2001
3   70.1    130.1   2001
4   80.1    140.1   2003
5   90.1    150.1   2003
6   0.0     160.1   2003
7   300.1   400.1   2004

df2 df2

data2 = {'x': [92.2, 30.1, 82.6, 51.1, 39.4, 10.1, 0, 299.1], 'y': [149.3, 100.1, 139.4, 111.1, 100.8, 180.1, 0, 402.5], 'year': [1950, 1951, 1952, 2000, 2000, 1954, 1955, 2004]}  
df2 = pd.DataFrame(data2)
df2

     x        y     year
0   92.2    149.3   1950
1   30.1    100.1   1951
2   82.6    139.4   1952
3   51.1    111.1   2000
4   39.4    100.8   2000
5   10.1    180.1   1954
6   0.0     0.0     1955
7   299.1   402.5   2004

Output: df Output:DF

new_col = []
for i in df.index:
if ((df['x'].iloc[i] - 2.5) < df2['x'].iloc[i] < (df['x'].iloc[i] + 2.5) and 
    (df['y'].iloc[i] - 2.5) < df2['y'].iloc[i] < (df['y'].iloc[i] + 2.5) and 
    df['year'].iloc[i] == df2['year'].iloc[i]):
    out = 1
else:
    out = 0
       
if out == 1:
    new_coll.append(1)
else: 
    new_col.append(0)
df['Result'] = new_col
df
            
      x       y     year   Result
0   40.1    100.1   2000    0
1   50.1    110.1   2000    0
2   60.1    120.1   2001    0
3   70.1    130.1   2001    0
4   80.1    140.1   2003    0
5   90.1    150.1   2003    0
6   0.0     160.1   2003    0
7   300.1   400.1   2004    1

But the output is not correct in terms of what i want.但是 output 就我想要的而言是不正确的。 It just compare row by row.它只是逐行比较。 I want to find: Is the first row in df inside df2 according to conditions?我想查找: df 中的第一行是否在 df2 中根据条件? It means check all rows in df2 for each row in df.这意味着为 df 中的每一行检查 df2 中的所有行。 So the expected output should be as below:所以预期的 output 应该如下所示:

Expected output: df预计 output:df

As you can see, 3 rows satisfy the conditions:
0 in df --> 4 in df2
1 in df --> 3 in df2
7 in df --> 7 in df2
    
So expected output:

     x        y     year   Result
0   40.1    100.1   2000    1
1   50.1    110.1   2000    1
2   60.1    120.1   2001    0
3   70.1    130.1   2001    0
4   80.1    140.1   2003    0
5   90.1    150.1   2003    0
6   0.0     160.1   2003    0
7   300.1   400.1   2004    1

This is the alternative solution with Pandas vectorization.这是 Pandas 矢量化的替代解决方案。 If your dataframe is small, you won't get much performance burden from for loop, however, for scalability and for Pandas best practice perspective, you can take a look at the vectorization in Pandas.如果你的 dataframe 很小,你不会从 for 循环中得到太多的性能负担,但是,为了可扩展性和 Pandas 最佳实践的观点,你可以看看 Pandas 中的矢量化。

Thanks to @Timus's comment, you can first merge the 2 dataframes with left on year .感谢@Timus 的评论,您可以先将 2 个数据框与left on year合并。

dfa = df.merge(df2, on='year', how='left', suffixes=('1', '2'))

Then, apply the conditions.然后,应用条件。

dfa['Result'] = ((dfa.x2 > dfa.x1 - 2.5) & 
                (dfa.x2 < dfa.x1 + 2.5) & 
                (dfa.y2 > dfa.y1 - 2.5) & 
                (dfa.y2 < dfa.y1 + 2.5))

Finally, you group by the df's x, y, year (x1, y1, year) and return True if any row's Result is True .最后,您按 df 的 x, y, year (x1, y1, year) 分组,如果任何行的 Result 为True True

# any() returns True if there is at least 1 True in Result per group.
dfa = dfa.groupby(['x1', 'y1', 'year']).Result.any().astype(int).reset_index()

Result结果

      x1     y1   year  Result
0    0.0  160.1   2003       0
1   40.1  100.1   2000       1
2   50.1  110.1   2000       1
3   60.1  120.1   2001       0
4   70.1  130.1   2001       0
5   80.1  140.1   2003       0
6   90.1  150.1   2003       0
7  300.1  400.1   2004       1

You can loop through each DataFrame and check for all combinations.您可以遍历每个 DataFrame 并检查所有组合。

for index, row in df.iterrows():
    for index2, row2 in df2.iterrows():
        if  (row['x']-2.5 < row2['x']  < row['x']+2.5) and (row['y']-2.5 < row2['y']  < row['y']+2.5):
            print(index,index2)
            df.loc[index, 'Result'] = 1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据两个数据框和条件添加一个新列 - Add a new column based on two dataframes and conditions 如何根据两个数据帧的条件创建 dataframe? - How can I create a dataframe based on conditions of two dataframes? 如何根据两个数据框中两列或三列之间的条件创建新的 boolean 列? - How to create a new boolean column based on conditions between two or three columns from two dataframes? 根据两个 Pandas DataFrames 之间的条件为新列赋值 - Assign values to new column based on conditions between two pandas DataFrames 根据两个条件添加新列 - Add new column based on two conditions 如何根据另一列中是否满足一组条件向 Python 中的数据框添加新列? - How can I add a new column to a dataframe in Python based on whether a set of conditions are met in another column? 如何在基于两个数据框之间的多个条件的数据框中获取新列? - How to get new column in dataframe that is based on multiple conditions between two dataframes? 如何根据列映射减去两个数据框? - How can I subtract two dataframes based on a column mapping? 如何根据 Pandas 中的一列列表组合两个数据帧 - How can I combine two dataframes based on a column of lists in Pandas 如何根据另一列的两个连续值在 pandas 的新列中添加 label? - How can I add a label in a new column in pandas based on two consecutive values of another column?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM