简体   繁体   中英

Pandas, efficiently converting from timespan dataframe to year-month enumerated dataframe

I currently have a data frame with an ID, category, and a time span given by start and end dates. I would like to convert this timespan data frame into one where each row corresponds to a given YYYY MM for a each ID and category.

The code below shows an example starting df and what I would typically do to create this YYYY MM enumerated data frame. There's some annoying math with the dates to ensure that I capture every YYYY MM inclusive between the start and end dates, but that's not terribly important for my question.

The issue I run into is that in reality, I need to run this on a df that has nearly 6 million timespan entries. I'm wondering if there is a better way to make use of pandas instead of basically accomplishing this with a for loop? This will run, but it winds up taking a few hours to crawl through the entire dataframe. It just didn't seem too obvious to me that I could accomplish this with any method other than looping?

import pandas as pd
from dateutil.relativedelta import relativedelta
from datetime import timedelta

df = pd.DataFrame({'ID': [1,1,2], 'start': ['2001-01-01', '2002-01-02', '2001-05-07'],
             'end': ['2002-01-12', '2002-01-14', '2002-05-01'], 'category': ['A', 'B', 'A']})
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])

df_list = []
for index,row in df.iterrows():
    start = row.start - timedelta(days=row.start.day-1)
    stop = row.end - timedelta(days=row.end.day-1) + relativedelta(months=1)
    tempdf = pd.DataFrame({'ID': row.ID, 'year':pd.date_range(start, stop, freq='1M').year,
                         'month': pd.date_range(start, stop, freq='1M').month, 
                         'category': row.category})
    df_list.append(tempdf)

newdf = pd.concat(df_list, ignore_index=True)

This works, largely based on the solution found here and here .

Here's the setup of the problem. I've further included a timespan that has a bad value.

import pandas as pd
import numpy as np

df = pd.DataFrame({'ID': [1,1,2,2], 'start': ['2001-01-01', '2002-01-02', '2001-05-07', '7'],
             'end': ['2002-01-12', '2002-01-17', '2002-05-01', '2002-05-01'], 'category': ['A', 'B', 'A', 'C']})
df['start'] = pd.to_datetime(df['start'], errors='coerce')
df['end'] = pd.to_datetime(df['end'], errors='coerce')

Then the solution. With this method there's actually not really any need to do messy algebra with the dates. resampling gets the YYYY MM enumeration inclusive with the start and end dates.

df['temp_id'] = range(len(df))
real_span = np.logical_and(df.start.notnull(), df.end.notnull())

df_mm = (df.loc[real_span, ['temp_id', 'start', 'end']].set_index('temp_id').stack()
         .reset_index(level=-1, drop=True).rename('period').to_frame())
df_mm = df_mm.groupby('temp_id').apply(lambda x: x.set_index('period').resample('M').asfreq()).reset_index()

df_mm['month'] = df_mm['period'].dt.month
df_mm['year'] = df_mm['period'].dt.year
df_mm.merge(df, on=['temp_id']).drop(columns=['temp_id', 'period', 'end', 'start'])

The only issue this will have is if something starts and ends on the same date. In that case, the set_index will throw ValueError: cannot reindex from a duplicate axis . Wouldn't be difficult to just add the condition that df.start != df.end to the real_span mask to filter them out. Or if they should be included then this will catch them without throwing an error. It basically just changes the dates to the first and last days of the months, which won't affect the resampling to the YYYY MM enumerated DataFrame, but ensures that the start date never equals the end date.

from datetime import timedelta
from pandas.tseries.offsets import MonthEnd

df['start'] = (df.start[df.start.notnull()] 
    - df.start[df.start.notnull()].dt.day.apply(timedelta) 
    + timedelta(days=1))
df['end'] = (df.end[df.end.notnull()] 
    - df.end[df.end.notnull()].dt.day.apply(timedelta) 
    + timedelta(days=1)) + MonthEnd(1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM