简体   繁体   English

在 Pandas 数据框中的 2 个日期之间添加日期列

[英]Add date columns between 2 dates in Pandas dataframe

I have an existing dataframe which looks like:我有一个现有的数据框,它看起来像:

    id  start_date  end_date
0   1   20170601    20210531
1   2   20181001    20220930
2   3   20150101    20190228
3   4   20171101    20211031

I am trying to add 85 columns to this dataframe which are:我正在尝试向此数据框添加 85 列,它们是:

  • if the month/year (looping on start_date to end_date) lie between 20120101 and 20190101: 1如果月/年(在 start_date 到 end_date 上循环)介于 20120101 和 20190101 之间:1
  • else: 0其他:0

I tried the following method:我尝试了以下方法:

start, end = [datetime.strptime(_, "%Y%m%d") for _ in ['20120101', '20190201']]
global_list = list(OrderedDict(((start + timedelta(_)).strftime(r"%m/%y"), None) for _ in range((end - start).days)).keys())

def get_count(contract_start_date, contract_end_date):
    start, end = [datetime.strptime(_, "%Y%m%d") for _ in [contract_start_date, contract_end_date]]
    current_list = list(OrderedDict(((start + timedelta(_)).strftime(r"%m/%y"), None) for _ in range((end - start).days)).keys())
    temp_list = []
    for each in global_list:
        if each in current_list:
            temp_list.append(1)
        else:
            temp_list.append(0)
    return pd.Series(temp_list)

sample_df[global_list] = sample_df[['contract_start_date', 'contract_end_date']].apply(lambda x: get_count(*x), axis=1)

and the sample df looks like:示例 df 如下所示:

customer_id contract_start_date contract_end_date   01/12   02/12   03/12   04/12   05/12   06/12   07/12   ... 04/18   05/18   06/18   07/18   08/18   09/18   10/18   11/18   12/18   01/19
1   1   20181001    20220930    0   0   0   0   0   0   0   ... 0   0   0   0   0   0   1   1   1   1
9   2   20160701    20200731    0   0   0   0   0   0   0   ... 1   1   1   1   1   1   1   1   1   1
3   3   20171101    20211031    0   0   0   0   0   0   0   ... 1   1   1   1   1   1   1   1   1   1
3 rows × 88 columns

it works fine for small dataset but for 160k rows it didn't stopped even after 3 hours.它适用于小型数据集,但对于 160k 行,即使在 3 小时后也没有停止。 Can someone tell me a better way to do this?有人可以告诉我更好的方法吗?

Facing problems when the dates overlap for same customer.当同一客户的日期重叠时面临问题。 在此处输入图片说明

First I'd cut off the dud dates, to normalize the end_time (to ensure it's in the time range):首先,我会切断无用日期,以使 end_time 正常化(以确保它在时间范围内):

In [11]: df.end_date = df.end_date.where(df.end_date < '2019-02-01', pd.Timestamp('2019-01-31')) + pd.offsets.MonthBegin()

In [12]: df
Out[12]:
   id start_date   end_date
0   1 2017-06-01 2019-02-01
1   2 2018-10-01 2019-02-01
2   3 2015-01-01 2019-02-01
3   4 2017-11-01 2019-02-01

Note: you'll need to do the same trick for start_date if there are dates prior to 2012.注意:如果有 2012 年之前的日期,您需要对start_date执行相同的技巧。

I'd create the resulting DataFrame from a date range of the columns and then fill it in (with ones at start time and something else:我会从列的日期范围创建生成的 DataFrame,然后填写它(在开始时间和其他内容中填写:

In [13]: m = pd.date_range('2012-01-01', '2019-02-01', freq='MS')

In [14]: res = pd.DataFrame(0., columns=m, index=df.index)

In [15]: res.update(pd.DataFrame(np.diag(np.ones(len(df))), df.index, df.start_date).groupby(axis=1, level=0).sum())

In [16]: res.update(-pd.DataFrame(np.diag(np.ones(len(df))), df.index, df.end_date).groupby(axis=1, level=0).sum())

The groupby sum is required if multiple rows start or end in the same month.如果多行在同一月开始或结束,则需要 groupby 总和。

# -1 and NaN were really placeholders for zero
In [17]: res = res.replace(0, np.nan).ffill(axis=1).replace([np.nan, -1], 0)

In [18]: res
Out[18]:
   2012-01-01  2012-02-01  2012-03-01  2012-04-01  2012-05-01     ...      2018-09-01  2018-10-01  2018-11-01  2018-12-01  2019-01-01
0         0.0         0.0         0.0         0.0         0.0     ...             1.0         1.0         1.0         1.0         1.0
1         0.0         0.0         0.0         0.0         0.0     ...             0.0         1.0         1.0         1.0         1.0
2         0.0         0.0         0.0         0.0         0.0     ...             1.0         1.0         1.0         1.0         1.0
3         0.0         0.0         0.0         0.0         0.0     ...             1.0         1.0         1.0         1.0         1.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 添加日期之间存在差异的列 pandas DataFrame - Add column with difference between dates pandas DataFrame Pandas Dataframe在2列上合并,包括有条件的If合并:如果df_2中的日期在df_1中的其他两个日期之间 - Pandas Dataframe Merge on 2 Columns Including Conditional If Merge: If Date in df_2 is Between Two Other Dates in df_1 Pandas Dataframe 保留日期在两个日期之间的行(单独的列) - Pandas Dataframe keep rows where date is between two dates (seperate columns) 如果另一个 Python pandas dataframe 中的两个日期之间的日期,则更新列 - Update column if date between 2 dates in another Python pandas dataframe 无法比较日期变量和 pandas dataframe 之间的日期 - Cannot compare dates between date variable and pandas dataframe pandas dataframe 中两个日期变量之间的日期列表 - List of dates between two date variables in pandas dataframe 熊猫数据框如何根据日期向量中的排名添加列 - Pandas dataframe how to add a columns based on rank in a dates vector 如果日期在2个日期之间,Python Pandas会在列中求和一个恒定值 - Python Pandas sum a constant value in Columns If date between 2 dates Python Pandas 列中的总和值如果日期介于 2 个日期之间 - Python Pandas Sum Values in Columns If date between 2 dates 获取开始日期和结束日期 pandas 列之间的所有日期 - Get all dates between start and end date pandas columns
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM