简体   繁体   中英

Pandas/Numpy - Vectorize datetime calculation

tl;dr

  1. I need df.dates[iter]-df.dates[initial_fixed] per slice of a dataframe indexed by an item_id in the fastest way possible (for the sake of learning and improving skills... and deadlines).

  2. How to calculate business hours between these same dates, not just straight time. And I need partial days (4.763 days for example) not just an integer like with .days

Hi,

First, I have a dataframe df

item_id      dates               new_column   ...   other_irrelevant_columns

101          2020-09-10-08-...   FUNCTION           -neglected-
101          2020-09-18-17-...   FUNCTION           -neglected-
101          2020-10-03-11-...   FUNCTION           -neglected-

107          2017-08-dd-hh-...   FUNCTION           -neglected-
107          2017-09-dd-hh-...   FUNCTION           -neglected-

209          2019-01-dd-hh-...   FUNCTION           -neglected
209          2019-01-dd-hh-...   FUNCTION           -neglected-
209          2019-01-dd-hh-...   FUNCTION           -neglected-
209          2019-01-dd-hh-...   FUNCTION           -neglected-

where the dates column (type = datetime object) is chronological per item_id, so the first instance is the earliest date.

I have over 400,000 rows, and I need to calculate the elapsed time by taking the distance between each datetime and the origin datetime, per item_id. Then there is a sequence

item_id      dates               [new_column        = elapsed_time]   ...   other_irrelevant_columns

101          2020-09-10-08-...   [dates[0]-dates[0] = 0       days]         -neglected- for plotting
101          2020-09-18-17-...   [dates[1]-dates[0] = 8.323   days]         -neglected-
101          2020-10-03-11-...   [dates[2]-dates[0] = 23.56   days]         -neglected-

At the moment, I'm stuck using a for loop which I think is vectorized, which calculates the total seconds of a timedelta and converts to days as a float:

for id in df.item_id:
    df.elapsed_days[df.item_id == id] = ((df.dates[df.item_id == id] - min(df.dates[df.boot_id == id])).dt.total_seconds()/86400).astype(float)

which is taking forever. Not in the data science spirit. What I'd like to know, is a better way to perform this whether it's using apply() with a lambda, and I tried to use digitize and isin() from this guys article but can't fathom how to bin the item_id to make it work.

Second, I am also interested in a similar duration but over business hours only (8am-6pm no weekends or holidays in Canada),so the real time that the item is active is measured.

Thanks for any help.

You can use join to do that much faster.

First you need to perform the min as you do in your current code:

tmp = df.loc[df['item_id'] == df['boot_id']] # row filtering
tmp = df[['item_id','date']] # column filtering
dateMin = tmp.groupby('item_id', as_index=False).min() # Find the minimal date for each item_id

Then you can do the merge:

# Actual merge
indexed_df = df.set_index('item_id')
indexed_dateMin = dateMin.set_index('item_id')
merged = indexed_df.join(indexed_dateMin, lsuffix='_df', rsuffix='_dateMin')

# Vectorized computation
df['elapsed_days'] = (merged['date_df'] - merged['date_dateMin']).dt.total_seconds()/86400).astype(float)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM