简体   繁体   English

在 Pandas 中逐行比较一个日期框中的日期列值与另一个数据框中的两个日期列

[英]Comparing date column values in one dateframe with two date column in another dataframe by row in Pandas

I have a dataframe like this with two date columns and a quamtity column :我有一个像这样的数据框,有两个日期列和一个数量列:

     start_date       end_date          qty
1    2018-01-01      2018-01-08         23
2    2018-01-08      2018-01-15         21           
3    2018-01-15      2018-01-22         5
4    2018-01-22      2018-01-29         12

I have a second dataframe with just column containing yearly holidays for a couple of years, like this:我有第二个数据框,其中只有包含几年年度假期的列,如下所示:

         holiday
1       2018-01-01 
2       2018-01-27
3       2018-12-25
4       2018-12-26

I would like to go through the first dataframe row by row and assign boolean value to a new column holidays if a date in the second data frame falls between the date values of the first date frame.如果第二个数据框中的日期介于第一个日期框的日期值之间,我想逐行浏览第一个数据框,并将布尔值分配给新的假日列。 The result would look like this:结果如下所示:

  start_date       end_date          qty         holidays
1    2018-01-01      2018-01-08         23       True
2    2018-01-08      2018-01-15         21       False  
3    2018-01-15      2018-01-22         5        False
4    2018-01-22      2018-01-29         12       True

When I try to do that with a for loop I get the following error:当我尝试使用 for 循环执行此操作时,出现以下错误:

ValueError: Can only compare identically-labeled Series objects ValueError:只能比较标记相同的系列对象

An answer would be appreciated.一个答案将不胜感激。

If you want a fully-vectorized solution, consider using the underlying numpy arrays:如果您想要一个完全矢量化的解决方案,请考虑使用底层的numpy数组:

import numpy as np


def holiday_arr(start, end, holidays):
    start = start.reshape((-1, 1))
    end = end.reshape((-1, 1))
    holidays = holidays.reshape((1, -1))
    result = np.any(
        (start <= holiday) & (holiday <= end),
        axis=1
    )
    return result

If you have your dataframes as above (calling them df1 and df2 ), you can obtain your desired result by running:如果您有上述数据帧(称为df1df2 ),您可以通过运行获得所需的结果:

df1["contains_holiday"] = holiday_arr(
    df1["start_date"].to_numpy(),
    df1["end_date"].to_numpy(),
    df2["holiday"].to_numpy()
)

df1 then looks like: df1然后看起来像:

  start_date   end_date  qty  contains_holiday
1 2018-01-01 2018-01-08   23              True
2 2018-01-08 2018-01-15   21             False
3 2018-01-15 2018-01-22    5             False
4 2018-01-22 2018-01-29   12              True

try:尝试:

def _is_holiday(row, df2):
    return ((df2['holiday'] >= row['start_date']) & (df2['holiday'] <= row['end_date'])).any()

df1.apply(lambda x: _is_holiday(x, df2), axis=1)

I'm not sure why you would want to go row-by-row.我不确定你为什么要逐行进行。 But boolean comparisons would be way faster.但是布尔比较会更快。

df['holiday'] = ((df2.holiday >= df.start_date) &  (df2.holiday <= df.end_date))

Time时间

>>> 1000 loops, best of 3: 1.05 ms per loop

Quoting @hchw solution (row-by-row)引用@hchw 解决方案(逐行)

def _is_holiday(row, df2):
    return ((df2['holiday'] >= row['start_date']) & (df2['holiday'] <= row['end_date'])).any()

df.apply(lambda x: _is_holiday(x, df2), axis=1)
>>> The slowest run took 4.89 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 4.46 ms per loop

Try IntervalIndex.contains with list comprehensiont and np.sum尝试IntervalIndex.contains和 list comprehensiont 和np.sum

iix = pd.IntervalIndex.from_arrays(df1.start_date, df1.end_date, closed='both')
df1['holidays'] = np.sum([iix.contains(x) for x in df2.holiday], axis=0) >= 1

Out[812]:
  start_date   end_date  qty  holidays
1 2018-01-01 2018-01-08   23      True
2 2018-01-08 2018-01-15   21     False
3 2018-01-15 2018-01-22    5     False
4 2018-01-22 2018-01-29   12      True

Note : I assume start_date , end_date , holiday columns are in datetime format.注意:我假设start_dateend_dateholiday列是日期时间格式。 If they are not, you need to convert them before run above command as follows如果不是,则需要在运行上述命令之前转换它们,如下所示

df1.start_date = pd.to_datetime(df1.start_date)
df1.end_date = pd.to_datetime(df1.end_date)
df2.holiday = pd.to_datetime(df2.holiday)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 pandas dataframe 中为另一个日期框列中的每个日期添加一行 - Add a row in pandas dataframe for every date in another dateframe column 将一个日期列与另一行中的另一个日期列进行比较 - Comparing one date column to another in a different row 如何将一个Pandas数据帧列与日期相结合,另一个与时间相结合? - How to combine one Pandas dataframe column with a date and another with a time? 根据日期列在熊猫数据框中插入行 - Insert row in pandas Dataframe based on Date Column 熊猫根据行,列和日期过滤DataFrame - Pandas filter DataFrame based on row , column and date 熊猫数据框:将列中的日期转换为行中的值 - Pandas dataframe: turn date in column into value in row Pandas Dataframe:根据将一列的每个值与另一列的所有值进行比较来分配新列 - Pandas Dataframe: assigning a new column based comparing each value of one column to all the values of another 在熊猫的日期列上加入两个数据框 - joining two dataframe on date column in pandas 更改 Pandas dataframe 中的值,但保留日期列 - Changing values in Pandas dataframe, but keeping date column Python/Pandas:在一个 dataframe 中搜索日期,并在另一个 dataframe 的列中返回具有匹配日期的值 - Python/Pandas: Search for date in one dataframe and return value in column of another dataframe with matching date
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM