简体   繁体   中英

Filter data where date is within +/-30 days of multiple given dates

I have a dataset where each observation has a Date . Then I have a list of events. I want to filter the dataset and keep observations only if the date is within +/- 30 days of an event. Also, I want to know which event it is closest to.

For example, the main dataset looks like:

Product Date
Chicken 2008-09-08
Pork    2008-08-22
Beef    2008-08-15
Rice    2008-07-22
Coke    2008-04-05
Cereal  2008-04-03
Apple   2008-04-02
Banana  2008-04-01

It is generated by

d = {'Product': ['Apple', 'Banana', 'Cereal', 'Coke', 'Rice', 'Beef', 'Pork', 'Chicken'],
     'Date': ['2008-04-02', '2008-04-01', '2008-04-03', '2008-04-05',
              '2008-07-22', '2008-08-15', '2008-08-22', '2008-09-08']}

df = pd.DataFrame(data = d)

df['Date'] = pd.to_datetime(df['Date'])

Then I have a column of events:

Date
2008-05-03
2008-07-20
2008-09-01

generated by

event = pd.DataFrame({'Date': pd.to_datetime(['2008-05-03', '2008-07-20', '2008-09-01'])})

GOAL (EDITED)

I want to keep the rows in df only if df['Date'] is within a month of event['Date'] . For example, the first event occurred on 2008-05-03, so I want to keep observations between 2008-04-03 and 2008-06-03, and also create a new column to tell this observation is closest to the event on 2008-05-03.

Product Date        Event
Chicken 2008-09-08  2008-09-01
Pork    2008-08-22  2008-09-01
Beef    2008-08-15  2008-07-20
Rice    2008-07-22  2008-07-20
Coke    2008-04-05  2008-05-03
Cereal  2008-04-03  2008-05-03

Use numpy broadcast and assumed within 30 days

df[np.any(np.abs(df.Date.values[:,None]-event.Date.values)/np.timedelta64(1,'D')<31,1)]
Out[90]: 
   Product       Date
0  Chicken 2008-09-08
1     Pork 2008-08-22
2     Beef 2008-08-15
3     Rice 2008-07-22
4     Coke 2008-04-05
5   Cereal 2008-04-03
event['eDate'] = event.Date    
df = pd.merge_asof(df.sort_values('Date'), event.sort_values('Date'), on="Date", direction='nearest')
df[(df.Date - df.eDate).abs() <= '30 days']

I would use listcomp with intervalindex

ms = pd.offsets.MonthOffset(1)
e1 = event.Date - ms
e2 = event.Date + ms
iix = pd.IntervalIndex.from_arrays(e1, e2, closed='both')
df.loc[[any(d in i for i in iix) for d in df.Date]]

Out[93]:
   Product       Date
2   Cereal 2008-04-03
3     Coke 2008-04-05
4     Rice 2008-07-22
5     Beef 2008-08-15
6     Pork 2008-08-22
7  Chicken 2008-09-08

If it just months irrespective of dates, this may be useful.

rng=[]
for a, b in zip (event['Date'].dt.month-1, event['Date'].dt.month+1):
    rng = rng + list(range(a-1,b+1,1))
df[df['Date'].dt.month.isin(set(rng))]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM