[英]Assign value to dataframe from another dataframe based on two conditions
I am trying to assign values from a column in df2['values'] to a column df1['values'].我正在尝试将 df2['values'] 中一列的值分配给 df1['values'] 列。 However values should only be assigned if:
但是,只有在以下情况下才应分配值:
So far I have this code, which works, but is far from efficient, since it takes me two days to process the two dfs (df1 has ca. 700k rows).到目前为止,我有这段代码,它有效,但效率远非有效,因为我需要两天时间来处理两个 dfs(df1 有大约 700k 行)。
for i in df1.category.unique():
for j in df2.category.unique():
if i == j: # matching categories
for ia, ra in df1.loc[df1['category'] == i].iterrows():
for ib, rb in df2.loc[df2['category'] == j].iterrows():
if df1['date'][ia] in df2['date_range'][ib]:
df1.loc[ia, 'values'] = rb['values']
break
I read that I should try to avoid using for-loops when working with dataframes.我读到我应该在处理数据帧时尽量避免使用 for 循环。 List comprehensions are great, however since I do not have a lot of experience yet, I struggle formulating more complicated code.
列表推导式很棒,但是由于我还没有很多经验,所以我很难制定更复杂的代码。
How can I iterate over this problem more efficient?我怎样才能更有效地迭代这个问题? What essential key aspect should I think about when iterating over dataframes with conditions?
在有条件的数据帧上迭代时,我应该考虑哪些重要的关键方面?
The code above tends to skip some rows or assigns them wrongly, so I need to do a cleanup afterwards.上面的代码往往会跳过某些行或错误地分配它们,因此我需要在之后进行清理。 And the biggest problem, that it is really slow.
最大的问题是它真的很慢。
Thank you.谢谢你。
Some df1 insight:一些 df1 见解:
df1.head()
date category
0 2015-01-07 f2
1 2015-01-26 f2
2 2015-01-26 f2
3 2015-04-08 f2
4 2015-04-10 f2
Some df2 insight:一些 df2 见解:
df2.date_range[0]
DatetimeIndex(['2011-11-02', '2011-11-03', '2011-11-04', '2011-11-05',
'2011-11-06', '2011-11-07', '2011-11-08', '2011-11-09',
'2011-11-10', '2011-11-11', '2011-11-12', '2011-11-13',
'2011-11-14', '2011-11-15', '2011-11-16', '2011-11-17',
'2011-11-18'],
dtype='datetime64[ns]', freq='D')
df2 other two columns: df2 其他两列:
df2[['values','category']].head()
values category
0 01 f1
1 02 f1
2 2.1 f1
3 2.2 f1
4 03 f1
Edit: Corrected erroneous code and added OP input from a comment编辑:更正了错误的代码并从注释中添加了 OP 输入
Alright so if you want to join the dataframes on similar categories, you can merge
them :好吧,如果你想加入相似类别的数据框,你可以
merge
它们:
import pandas as pd
df3 = df1.merge(df2, on = "category")
Next, since date
is a timestamp and the "date_range" is actually generated from two columns, per OP's comment, we rather use :接下来,由于
date
是一个时间戳,而“date_range”实际上是从两列生成的,根据 OP 的评论,我们宁愿使用:
mask = (df3["startdate"] <= df3["date"]) & (df3["date"] <= df3["enddate"])
subset = df3.loc[mask]
Now we get back to df1
and merge on the common dates while keeping all the values from df1
.现在我们回到
df1
并在公共日期合并,同时保留df1
所有值。 This will create NaN
for the subset values where they didn't match with df1
in the earlier merge.这将为在早期合并中与
df1
不匹配的子集值创建NaN
。
As such, we set df1["values"]
where the entries in common are not NaN
and we leave them be otherwise.因此,我们将
df1["values"]
设置为公共条目不是NaN
,否则我们将它们保留。
common_dates = df1.merge(subset, on = "date", how= "left") # keeping df1 values
df1["values"] = np.where(common_dates["values_y"].notna(),
common_dates["values_y"], df1["values"])
NB : If more than one df1["date"]
matches with the date range, you'll have to drop some values otherwise duplicates mess up the explanation.注意:如果超过一个
df1["date"]
与日期范围匹配,您将不得不删除一些值,否则重复会混淆解释。
You could accomplish the first point:你可以完成第一点:
1. df2['category'] is equal to the df1['category'] 1. df2['category'] 等于 df1['category']
with the use of a join.使用连接。
You could then use a for loop for filtering the data poings from df1[date] inside the merged dataframe that are not contemplated in the df2[date_range].然后,您可以使用 for 循环来过滤来自 df1[date] 合并数据帧内的数据,这些数据在 df2[date_range] 中没有考虑。 Unfortunately I need more information about the content of df1[date] and df2[date_range] to write the code here that would exactly do that.
不幸的是,我需要更多关于 df1[date] 和 df2[date_range] 内容的信息来在这里编写代码来完全做到这一点。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.