[英]Comparing date column values in one dateframe with two date column in another dataframe by row in Pandas
I have a dataframe like this with two date columns and a quamtity column :我有一个像这样的数据框,有两个日期列和一个数量列:
start_date end_date qty
1 2018-01-01 2018-01-08 23
2 2018-01-08 2018-01-15 21
3 2018-01-15 2018-01-22 5
4 2018-01-22 2018-01-29 12
I have a second dataframe with just column containing yearly holidays for a couple of years, like this:我有第二个数据框,其中只有包含几年年度假期的列,如下所示:
holiday
1 2018-01-01
2 2018-01-27
3 2018-12-25
4 2018-12-26
I would like to go through the first dataframe row by row and assign boolean value to a new column holidays if a date in the second data frame falls between the date values of the first date frame.如果第二个数据框中的日期介于第一个日期框的日期值之间,我想逐行浏览第一个数据框,并将布尔值分配给新的假日列。 The result would look like this:
结果如下所示:
start_date end_date qty holidays
1 2018-01-01 2018-01-08 23 True
2 2018-01-08 2018-01-15 21 False
3 2018-01-15 2018-01-22 5 False
4 2018-01-22 2018-01-29 12 True
When I try to do that with a for loop I get the following error:当我尝试使用 for 循环执行此操作时,出现以下错误:
ValueError: Can only compare identically-labeled Series objects
ValueError:只能比较标记相同的系列对象
An answer would be appreciated.一个答案将不胜感激。
If you want a fully-vectorized solution, consider using the underlying numpy
arrays:如果您想要一个完全矢量化的解决方案,请考虑使用底层的
numpy
数组:
import numpy as np
def holiday_arr(start, end, holidays):
start = start.reshape((-1, 1))
end = end.reshape((-1, 1))
holidays = holidays.reshape((1, -1))
result = np.any(
(start <= holiday) & (holiday <= end),
axis=1
)
return result
If you have your dataframes as above (calling them df1
and df2
), you can obtain your desired result by running:如果您有上述数据帧(称为
df1
和df2
),您可以通过运行获得所需的结果:
df1["contains_holiday"] = holiday_arr(
df1["start_date"].to_numpy(),
df1["end_date"].to_numpy(),
df2["holiday"].to_numpy()
)
df1
then looks like: df1
然后看起来像:
start_date end_date qty contains_holiday
1 2018-01-01 2018-01-08 23 True
2 2018-01-08 2018-01-15 21 False
3 2018-01-15 2018-01-22 5 False
4 2018-01-22 2018-01-29 12 True
try:尝试:
def _is_holiday(row, df2):
return ((df2['holiday'] >= row['start_date']) & (df2['holiday'] <= row['end_date'])).any()
df1.apply(lambda x: _is_holiday(x, df2), axis=1)
I'm not sure why you would want to go row-by-row.我不确定你为什么要逐行进行。 But boolean comparisons would be way faster.
但是布尔比较会更快。
df['holiday'] = ((df2.holiday >= df.start_date) & (df2.holiday <= df.end_date))
Time时间
>>> 1000 loops, best of 3: 1.05 ms per loop
Quoting @hchw solution (row-by-row)引用@hchw 解决方案(逐行)
def _is_holiday(row, df2):
return ((df2['holiday'] >= row['start_date']) & (df2['holiday'] <= row['end_date'])).any()
df.apply(lambda x: _is_holiday(x, df2), axis=1)
>>> The slowest run took 4.89 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 4.46 ms per loop
Try IntervalIndex.contains
with list comprehensiont and np.sum
尝试
IntervalIndex.contains
和 list comprehensiont 和np.sum
iix = pd.IntervalIndex.from_arrays(df1.start_date, df1.end_date, closed='both')
df1['holidays'] = np.sum([iix.contains(x) for x in df2.holiday], axis=0) >= 1
Out[812]:
start_date end_date qty holidays
1 2018-01-01 2018-01-08 23 True
2 2018-01-08 2018-01-15 21 False
3 2018-01-15 2018-01-22 5 False
4 2018-01-22 2018-01-29 12 True
Note : I assume start_date
, end_date
, holiday
columns are in datetime format.注意:我假设
start_date
, end_date
, holiday
列是日期时间格式。 If they are not, you need to convert them before run above command as follows如果不是,则需要在运行上述命令之前转换它们,如下所示
df1.start_date = pd.to_datetime(df1.start_date)
df1.end_date = pd.to_datetime(df1.end_date)
df2.holiday = pd.to_datetime(df2.holiday)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.