Pandas/Numpy - Vectorize datetime calculation

Question

tl;dr

I need df.dates[iter]-df.dates[initial_fixed] per slice of a dataframe indexed by an item_id in the fastest way possible (for the sake of learning and improving skills... and deadlines).
How to calculate business hours between these same dates, not just straight time. And I need partial days (4.763 days for example) not just an integer like with .days

Hi,

First, I have a dataframe df

item_id      dates               new_column   ...   other_irrelevant_columns

101          2020-09-10-08-...   FUNCTION           -neglected-
101          2020-09-18-17-...   FUNCTION           -neglected-
101          2020-10-03-11-...   FUNCTION           -neglected-

107          2017-08-dd-hh-...   FUNCTION           -neglected-
107          2017-09-dd-hh-...   FUNCTION           -neglected-

209          2019-01-dd-hh-...   FUNCTION           -neglected
209          2019-01-dd-hh-...   FUNCTION           -neglected-
209          2019-01-dd-hh-...   FUNCTION           -neglected-
209          2019-01-dd-hh-...   FUNCTION           -neglected-

where the dates column (type = datetime object) is chronological per item_id, so the first instance is the earliest date.

I have over 400,000 rows, and I need to calculate the elapsed time by taking the distance between each datetime and the origin datetime, per item_id. Then there is a sequence

item_id      dates               [new_column        = elapsed_time]   ...   other_irrelevant_columns

101          2020-09-10-08-...   [dates[0]-dates[0] = 0       days]         -neglected- for plotting
101          2020-09-18-17-...   [dates[1]-dates[0] = 8.323   days]         -neglected-
101          2020-10-03-11-...   [dates[2]-dates[0] = 23.56   days]         -neglected-

At the moment, I'm stuck using a for loop which I think is vectorized, which calculates the total seconds of a timedelta and converts to days as a float:

for id in df.item_id:
    df.elapsed_days[df.item_id == id] = ((df.dates[df.item_id == id] - min(df.dates[df.boot_id == id])).dt.total_seconds()/86400).astype(float)

which is taking forever. Not in the data science spirit. What I'd like to know, is a better way to perform this whether it's using apply() with a lambda, and I tried to use digitize and isin() from this guys article but can't fathom how to bin the item_id to make it work.

Second, I am also interested in a similar duration but over business hours only (8am-6pm no weekends or holidays in Canada),so the real time that the item is active is measured.

Thanks for any help.

Answer 1

You can use join to do that much faster.

First you need to perform the min as you do in your current code:

tmp = df.loc[df['item_id'] == df['boot_id']] # row filtering
tmp = df[['item_id','date']] # column filtering
dateMin = tmp.groupby('item_id', as_index=False).min() # Find the minimal date for each item_id

Then you can do the merge:

# Actual merge
indexed_df = df.set_index('item_id')
indexed_dateMin = dateMin.set_index('item_id')
merged = indexed_df.join(indexed_dateMin, lsuffix='_df', rsuffix='_dateMin')

# Vectorized computation
df['elapsed_days'] = (merged['date_df'] - merged['date_dateMin']).dt.total_seconds()/86400).astype(float)

Pandas/Numpy - Vectorize datetime calculation

Question

1 answers

solution1
0 2020-03-01 09:24:17

Pandas/Numpy - Vectorize datetime calculation

Question

1 answers

solution1 0 2020-03-01 09:24:17

solution1
0 2020-03-01 09:24:17