tl;dr
I need df.dates[iter]-df.dates[initial_fixed]
per slice of a dataframe indexed by an item_id
in the fastest way possible (for the sake of learning and improving skills... and deadlines).
How to calculate business hours between these same dates, not just straight time. And I need partial days (4.763 days for example) not just an integer like with .days
Hi,
First, I have a dataframe df
item_id dates new_column ... other_irrelevant_columns
101 2020-09-10-08-... FUNCTION -neglected-
101 2020-09-18-17-... FUNCTION -neglected-
101 2020-10-03-11-... FUNCTION -neglected-
107 2017-08-dd-hh-... FUNCTION -neglected-
107 2017-09-dd-hh-... FUNCTION -neglected-
209 2019-01-dd-hh-... FUNCTION -neglected
209 2019-01-dd-hh-... FUNCTION -neglected-
209 2019-01-dd-hh-... FUNCTION -neglected-
209 2019-01-dd-hh-... FUNCTION -neglected-
where the dates column (type = datetime object) is chronological per item_id, so the first instance is the earliest date.
I have over 400,000 rows, and I need to calculate the elapsed time by taking the distance between each datetime and the origin datetime, per item_id. Then there is a sequence
item_id dates [new_column = elapsed_time] ... other_irrelevant_columns
101 2020-09-10-08-... [dates[0]-dates[0] = 0 days] -neglected- for plotting
101 2020-09-18-17-... [dates[1]-dates[0] = 8.323 days] -neglected-
101 2020-10-03-11-... [dates[2]-dates[0] = 23.56 days] -neglected-
At the moment, I'm stuck using a for
loop which I think is vectorized, which calculates the total seconds of a timedelta
and converts to days as a float:
for id in df.item_id:
df.elapsed_days[df.item_id == id] = ((df.dates[df.item_id == id] - min(df.dates[df.boot_id == id])).dt.total_seconds()/86400).astype(float)
which is taking forever. Not in the data science spirit. What I'd like to know, is a better way to perform this whether it's using apply() with a lambda, and I tried to use digitize and isin() from this guys article but can't fathom how to bin the item_id to make it work.
Second, I am also interested in a similar duration but over business hours only (8am-6pm no weekends or holidays in Canada),so the real time that the item
is active is measured.
Thanks for any help.
You can use join to do that much faster.
First you need to perform the min as you do in your current code:
tmp = df.loc[df['item_id'] == df['boot_id']] # row filtering
tmp = df[['item_id','date']] # column filtering
dateMin = tmp.groupby('item_id', as_index=False).min() # Find the minimal date for each item_id
Then you can do the merge:
# Actual merge
indexed_df = df.set_index('item_id')
indexed_dateMin = dateMin.set_index('item_id')
merged = indexed_df.join(indexed_dateMin, lsuffix='_df', rsuffix='_dateMin')
# Vectorized computation
df['elapsed_days'] = (merged['date_df'] - merged['date_dateMin']).dt.total_seconds()/86400).astype(float)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.