简体   繁体   中英

Vectorize a function to get closest date from a list of dates

I cant use external libraries other than pandas or numpy

If I had a dataframe like

import pandas as pd d = pd.DataFrame({'purchase_date':[datetime(2022,1,5),datetime(2022,12,10)],'shop_visits':[[datetime(2021,1,4),datetime(2022,1,6),datetime(2022,1,7)],[datetime(2022,5,30),datetime(2022,9,30),datetime(2022,12,10)]]})
purchase_dt shop_visits
2022-01-05 [2022-01-04,2022-01-06,2022-01-07]
2022-12-10 [2022-05-30,2022-09-30,2022-12-10]

I would need a function that returns 2022-01-06 for the first row and 2022-12-10 for the second, because both are the closest dates on the list after purchase_dt.

I have a function with apply already that does the min with my own lambda key, but its taking too long and I want to vectorize:

 d.apply(lambda x: min([d for d in x['shop_visits'] if d >= x['purchase_date']],key=lambda d: dx['purchase_date']),axis=1)

The thing is that I am not sure if this can be done using numpy.

here is one way to accomplish it

explode shop-visits, create a df2, where purchase_date is same or less than shop_visit. finally do a groupby to choose the first result

df2=df.explode('shop_visits') df2=df2[df2['purchase_date']<=df2['shop_visits']] df2[df2.groupby(['purchase_date']).cumcount().eq(0)]

OR

df2=df.explode('shop_visits') df2[df2['purchase_date']<=df2['shop_visits']].groupby('purchase_date').first().reset_index()
 purchase_date shop_visits 0 2022-01-05 2022-01-06 1 2022-12-10 2022-12-10

You can try merge_asof :

 left = d.drop(columns=["shop_visits"]).reset_index() right = d["shop_visits"].explode().sort_values().to_frame().reset_index() pd.merge_asof( left, right, by="index", left_on="purchase_date", right_on="shop_visits", direction="forward", ).drop(columns=["index"])

Here's a faster version without merge_asof :

 tmp = ( # Extract purchase_date and shop_visits d[["purchase_date", "shop_visits"]] # Explode shop_visits.explode("shop_visits") # Reset index so we can align using this index later.reset_index() # Since we only care about visits on or after the purchase_date, # filter out the rest.query("purchase_date <= shop_visits") # Sort by the original index and shop visit date.sort_values(["index", "shop_visits"]) # For each index, keep only the first visit.drop_duplicates("index") ) # Set the original index back on tmp and so that we can # easily align it with the original dataframe d["first_visit_after_purchase"] = tmp.set_index("index")[["shop_visits"]] # Note that some first_visit_after_purchase may be NaN due # to our query condition

Took 7 seconds on my ancient Mac when d has 1M rows and 10 visits for each row.

The processing time grows linearly with the total number of shop visits ( len(d["shop_visits"].explode()) ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM