简体   繁体   中英

Create a list of dates from pandas cell containing a string with two dates

I have a dataframe looking containing strings with date ranges, looking something like this:

  winter            easter          pentecost       summer
1 01.02. - 06.02.   31.03. - 10.04. 14.05.+25.05.   07.07. - 21.08.

now I want to generate a list of all dates that are within those ranges. Is there a more pythonic solution than doing the following for each row:

def add_years(d, years):
    """
    credits: https://stackoverflow.com/a/15743908/12934163
    """
    try:
        return d.replace(year = d.year + years)
    except ValueError:
        return d + (date(d.year + years, 1, 1) - date(d.year, 1, 1))


holidays_list = []
for col in holidays.columns:
    if holidays[col].str.contains('\+', na=True).values[0]:
        days_list = holidays[col].values[0].split('+')
        date_strings = [s + '2010' for s in days_list]
        holidays_list.extend([datetime.strptime(date, "%d.%m.%Y").date() for date in date_strings])
    else:
        days_list = holidays[col].str.split('-',1).tolist()
        days_list = [x.strip(' ') for x in days_list[0]]
        date_strings = [s + '2010' for s in days_list]
        date_dates = [datetime.strptime(date, "%d.%m.%Y").date() for date in date_strings]
        if date_dates[0] > date_dates[1]:
            date_dates[1] = add_years(date_dates[1],1)
        dates_between = list(pd.date_range(date_dates[0],date_dates[1],freq='d'))
        ferien_liste.extend(dates_between)

and appending the values of each column to one list? As you can see, some columns contain a + instead of a - , meaning that its not a range but rather two single days. Also, sometimes the ranges are over more than one year, say 23.12. - 01.01 23.12. - 01.01

You can use regexes to identify date patterns and extract day and month values from it. Put this code in a function to apply over your dataframe columns, like below (notice the pat1 and pat2 regexes I have used for your 2 cases):

def parse_date_patterns(pattern):
    pat1 = '(\d*).(\d*).\s*\-\s*(\d*).(\d*).'
    pat2 = '(\d*).(\d*).\s*\+\s*(\d*).(\d*).'
    if '-' in pattern:
        day_start, month_start, day_end, month_end = re.findall(pat1, pattern)[0]
        list_dates = pd.date_range(start='{m}.{d}.2010'.format(m=month_start, d=day_start), end='{m}.{d}.2010'.format(m=month_end, d=day_end)).tolist()
    elif '+' in pattern:
        day_start, month_start, day_end, month_end = re.findall(pat2, pattern)[0]
        list_dates = [pd.to_datetime('{m}.{d}.2010'.format(m=month_start, d=day_start)), pd.to_datetime('{m}.{d}.2010'.format(m=month_end, d=day_end))]
    return list_dates

Then you can just apply this function to all the columns of your dataframe:

df['winter'] = df.winter.apply(parse_date_patterns)
df['easter'] = df.easter.apply(parse_date_patterns) 
df['pentecost'] = df.pentecost.apply(parse_date_patterns)
df['summer'] = df.summer.apply(parse_date_patterns)

Your dataframe will now have required list of dates in each row:

>>> print(df)
                                                  winter                                             easter                                   pentecost                                             summer
    0  [2010-02-01 00:00:00, 2010-02-02 00:00:00, 201...  [2010-03-31 00:00:00, 2010-04-01 00:00:00, 201...  [2010-05-14 00:00:00, 2010-05-25 00:00:00]  [2010-07-07 00:00:00, 2010-07-08 00:00:00, 201...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM