[英]How can I add a new column based on two dataframes and conditions
How can I add a new column based on two dataframes and conditions?如何根据两个数据框和条件添加新列? For example, if df2['x'] is between df1['x']±2.5 and df2['y'] is between df1['y']±2.5, give 1 otherwise 0.
例如,如果 df2['x'] 在 df1['x']±2.5 之间并且 df2['y'] 在 df1['y']±2.5 之间,则给 1 否则为 0。
import pandas as pd
data = {'x': [40.1, 50.1, 60.1, 70.1, 80.1, 90.1, 0, 300.1 ], 'y': [100.1, 110.1, 120.1, 130.1, 140.1, 150.1, 160.1, 400.1], 'year': [2000, 2000, 2001, 2001, 2003, 2003, 2003, 2004]}
df = pd.DataFrame(data)
df
x y year
0 40.1 100.1 2000
1 50.1 110.1 2000
2 60.1 120.1 2001
3 70.1 130.1 2001
4 80.1 140.1 2003
5 90.1 150.1 2003
6 0.0 160.1 2003
7 300.1 400.1 2004
df2 df2
data2 = {'x': [92.2, 30.1, 82.6, 51.1, 39.4, 10.1, 0, 299.1], 'y': [149.3, 100.1, 139.4, 111.1, 100.8, 180.1, 0, 402.5], 'year': [1950, 1951, 1952, 2000, 2000, 1954, 1955, 2004]}
df2 = pd.DataFrame(data2)
df2
x y year
0 92.2 149.3 1950
1 30.1 100.1 1951
2 82.6 139.4 1952
3 51.1 111.1 2000
4 39.4 100.8 2000
5 10.1 180.1 1954
6 0.0 0.0 1955
7 299.1 402.5 2004
Output: df Output:DF
new_col = []
for i in df.index:
if ((df['x'].iloc[i] - 2.5) < df2['x'].iloc[i] < (df['x'].iloc[i] + 2.5) and
(df['y'].iloc[i] - 2.5) < df2['y'].iloc[i] < (df['y'].iloc[i] + 2.5) and
df['year'].iloc[i] == df2['year'].iloc[i]):
out = 1
else:
out = 0
if out == 1:
new_coll.append(1)
else:
new_col.append(0)
df['Result'] = new_col
df
x y year Result
0 40.1 100.1 2000 0
1 50.1 110.1 2000 0
2 60.1 120.1 2001 0
3 70.1 130.1 2001 0
4 80.1 140.1 2003 0
5 90.1 150.1 2003 0
6 0.0 160.1 2003 0
7 300.1 400.1 2004 1
But the output is not correct in terms of what i want.但是 output 就我想要的而言是不正确的。 It just compare row by row.
它只是逐行比较。 I want to find: Is the first row in df inside df2 according to conditions?
我想查找: df 中的第一行是否在 df2 中根据条件? It means check all rows in df2 for each row in df.
这意味着为 df 中的每一行检查 df2 中的所有行。 So the expected output should be as below:
所以预期的 output 应该如下所示:
Expected output: df预计 output:df
As you can see, 3 rows satisfy the conditions:
0 in df --> 4 in df2
1 in df --> 3 in df2
7 in df --> 7 in df2
So expected output:
x y year Result
0 40.1 100.1 2000 1
1 50.1 110.1 2000 1
2 60.1 120.1 2001 0
3 70.1 130.1 2001 0
4 80.1 140.1 2003 0
5 90.1 150.1 2003 0
6 0.0 160.1 2003 0
7 300.1 400.1 2004 1
This is the alternative solution with Pandas vectorization.这是 Pandas 矢量化的替代解决方案。 If your dataframe is small, you won't get much performance burden from for loop, however, for scalability and for Pandas best practice perspective, you can take a look at the vectorization in Pandas.
如果你的 dataframe 很小,你不会从 for 循环中得到太多的性能负担,但是,为了可扩展性和 Pandas 最佳实践的观点,你可以看看 Pandas 中的矢量化。
Thanks to @Timus's comment, you can first merge the 2 dataframes with left
on year
.感谢@Timus 的评论,您可以先将 2 个数据框与
left
on year
合并。
dfa = df.merge(df2, on='year', how='left', suffixes=('1', '2'))
Then, apply the conditions.然后,应用条件。
dfa['Result'] = ((dfa.x2 > dfa.x1 - 2.5) &
(dfa.x2 < dfa.x1 + 2.5) &
(dfa.y2 > dfa.y1 - 2.5) &
(dfa.y2 < dfa.y1 + 2.5))
Finally, you group by the df's x, y, year (x1, y1, year) and return True
if any row's Result is True
.最后,您按 df 的 x, y, year (x1, y1, year) 分组,如果任何行的 Result 为
True
True
# any() returns True if there is at least 1 True in Result per group.
dfa = dfa.groupby(['x1', 'y1', 'year']).Result.any().astype(int).reset_index()
Result结果
x1 y1 year Result
0 0.0 160.1 2003 0
1 40.1 100.1 2000 1
2 50.1 110.1 2000 1
3 60.1 120.1 2001 0
4 70.1 130.1 2001 0
5 80.1 140.1 2003 0
6 90.1 150.1 2003 0
7 300.1 400.1 2004 1
You can loop through each DataFrame and check for all combinations.您可以遍历每个 DataFrame 并检查所有组合。
for index, row in df.iterrows():
for index2, row2 in df2.iterrows():
if (row['x']-2.5 < row2['x'] < row['x']+2.5) and (row['y']-2.5 < row2['y'] < row['y']+2.5):
print(index,index2)
df.loc[index, 'Result'] = 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.