简体   繁体   English

根据两个条件为来自另一个数据帧的数据帧赋值

[英]Assign value to dataframe from another dataframe based on two conditions

I am trying to assign values from a column in df2['values'] to a column df1['values'].我正在尝试将 df2['values'] 中一列的值分配给 df1['values'] 列。 However values should only be assigned if:但是,只有在以下情况下才应分配值:

  1. df2['category'] is equal to the df1['category'] (rows are part of the same category) df2['category'] 等于 df1['category'] (行是同一类别的一部分)
  2. df1['date'] is in df2['date_range'] (date is in a certain range for a specific category) df1['date'] 在 df2['date_range'] 中(日期在特定类别的特定范围内)

So far I have this code, which works, but is far from efficient, since it takes me two days to process the two dfs (df1 has ca. 700k rows).到目前为止,我有这段代码,它有效,但效率远非有效,因为我需要两天时间来处理两个 dfs(df1 有大约 700k 行)。

for i in df1.category.unique():
for j in df2.category.unique():
    if i == j: # matching categories
        for ia, ra in df1.loc[df1['category'] == i].iterrows():
            for ib, rb in df2.loc[df2['category'] == j].iterrows():
                if df1['date'][ia] in df2['date_range'][ib]:
                    df1.loc[ia, 'values'] = rb['values']
                    break

I read that I should try to avoid using for-loops when working with dataframes.我读到我应该在处理数据帧时尽量避免使用 for 循环。 List comprehensions are great, however since I do not have a lot of experience yet, I struggle formulating more complicated code.列表推导式很棒,但是由于我还没有很多经验,所以我很难制定更复杂的代码。

How can I iterate over this problem more efficient?我怎样才能更有效地迭代这个问题? What essential key aspect should I think about when iterating over dataframes with conditions?在有条件的数据帧上迭代时,我应该考虑哪些重要的关键方面?

The code above tends to skip some rows or assigns them wrongly, so I need to do a cleanup afterwards.上面的代码往往会跳过某些行或错误地分配它们,因此我需要在之后进行清理。 And the biggest problem, that it is really slow.最大的问题是它真的很慢。

Thank you.谢谢你。

Some df1 insight:一些 df1 见解:

df1.head()

    date                          category
0  2015-01-07                       f2
1  2015-01-26                       f2
2  2015-01-26                       f2
3  2015-04-08                       f2
4  2015-04-10                       f2

Some df2 insight:一些 df2 见解:

df2.date_range[0]

DatetimeIndex(['2011-11-02', '2011-11-03', '2011-11-04', '2011-11-05',
               '2011-11-06', '2011-11-07', '2011-11-08', '2011-11-09',
               '2011-11-10', '2011-11-11', '2011-11-12', '2011-11-13',
               '2011-11-14', '2011-11-15', '2011-11-16', '2011-11-17',
               '2011-11-18'],
              dtype='datetime64[ns]', freq='D')

df2 other two columns: df2 其他两列:

df2[['values','category']].head()

            values             category
0            01                  f1
1            02                  f1
2           2.1                  f1
3           2.2                  f1
4            03                  f1

Edit: Corrected erroneous code and added OP input from a comment编辑:更正了错误的代码并从注释中添加了 OP 输入

Alright so if you want to join the dataframes on similar categories, you can merge them :好吧,如果你想加入相似类别的数据框,你可以merge它们:

import pandas as pd

df3 = df1.merge(df2, on = "category")

Next, since date is a timestamp and the "date_range" is actually generated from two columns, per OP's comment, we rather use :接下来,由于date是一个时间戳,而“date_range”实际上是从两列生成的,根据 OP 的评论,我们宁愿使用:

mask = (df3["startdate"] <= df3["date"]) & (df3["date"] <= df3["enddate"])

subset = df3.loc[mask]

Now we get back to df1 and merge on the common dates while keeping all the values from df1 .现在我们回到df1并在公共日期合并,同时保留df1所有值。 This will create NaN for the subset values where they didn't match with df1 in the earlier merge.这将为在早期合并中与df1不匹配的子集值创建NaN

As such, we set df1["values"] where the entries in common are not NaN and we leave them be otherwise.因此,我们将df1["values"]设置为公共条目不是NaN ,否则我们将它们保留。

common_dates = df1.merge(subset, on = "date", how= "left") # keeping df1 values

df1["values"] = np.where(common_dates["values_y"].notna(), 
                         common_dates["values_y"], df1["values"])

NB : If more than one df1["date"] matches with the date range, you'll have to drop some values otherwise duplicates mess up the explanation.注意:如果超过一个df1["date"]与日期范围匹配,您将不得不删除一些值,否则重复会混淆解释。

You could accomplish the first point:你可以完成第一点:

1. df2['category'] is equal to the df1['category'] 1. df2['category'] 等于 df1['category']

with the use of a join.使用连接。

You could then use a for loop for filtering the data poings from df1[date] inside the merged dataframe that are not contemplated in the df2[date_range].然后,您可以使用 for 循环来过滤来自 df1[date] 合并数据帧内的数据,这些数据在 df2[date_range] 中没有考虑。 Unfortunately I need more information about the content of df1[date] and df2[date_range] to write the code here that would exactly do that.不幸的是,我需要更多关于 df1[date] 和 df2[date_range] 内容的信息来在这里编写代码来完全做到这一点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据多个条件(lt、gt 测试标准)将数据帧索引作为值分配给另一个数据帧 - Assign a dataframe index as a value to another dataframe based on multiple conditions (lt, gt test criteria) 根据条件将值从一个 dataframe 分配给另一个 - Assign value from one dataframe to another based on condition 根据两个条件在DataFrame中更改值 - Changing value in DataFrame based on two conditions 根据条件过滤另一个 dataframe 后,将值应用于 dataframe - Apply value to a dataframe after filtering another dataframe based on conditions 如何根据不同的条件为 pandas dataframe 中的特定列赋值? - How to assign value to particular column in pandas dataframe based on different conditions? 根据另一个 dataframe 的条件更新 dataframe 的值 - Update values of a dataframe based on conditions from another dataframe 根据条件从 dataframe 获取值计数 - Getting value counts from dataframe based on conditions 根据来自另一个数据框的条件创建新的数据框进行循环 - Create a new Dataframe based on conditions from another Dataframe for loop 如何根据条件从一个 dataframe 到另一个 dataframe 中的 select 行 - How to select rows from a dataframe based on conditions with another dataframe 根据两个数据框中的条件,从另一个 dataframe 添加一列到 dataframe - Add a column to a dataframe from another dataframe based on conditions in both dataframes
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM