简体   繁体   中英

How to create a new boolean column in a dataframe based on multiple conditions from other dataframe in pandas

I have a dataframe

entity  response    date
p   a1  1-Feb-14
p   a2  2-Feb-14
p   a3  3-Feb-14
p   a4  4-Feb-14
p   a5  5-Feb-14
p   a6  6-Feb-14
p   a7  7-Feb-14
p   a8  8-Feb-14
p   a9  9-Feb-14
p   a10 10-Feb-14
p   a11 11-Feb-14
p   a12 12-Feb-14
p   a13 13-Feb-14
p   a14 14-Feb-14
p   a15 15-Feb-14

and another data frame :

entity  start_date  end_date
p   2-Feb-14    4-Feb-14
p   6-Feb-14    7-Feb-14
p   9-Feb-14    12-Feb-14
q   1-Feb-14    7-Feb-14

based on the second data frame I have to create a True False column in the 1st dataframe for P if the date lies between any of start and end date window it should be true else false.

What could be the fastest way of doing this and shortest as well. I tried iterating over the whole data frame but that takes time and makes the code long as well

Maybe I'm overthinking, but

def f(s):
    f2 = lambda d, n: ((d >= df2[df2.entity == n].start_date) & (d <= df2[df2.entity==n].end_date)).any()
    return(s.transform(f2, n=s.name))

df.groupby('entity').date.transform(f)

0     False
1      True
2      True
3      True
4     False
5      True
6      True
7     False
8      True
9      True
10     True
11     True
12    False
13    False
14    False
15    False
Name: date, dtype

You can also do some preprocessing first to speed up the process

df2['j']  = df2.agg(lambda k: pd.Interval(k.start_date, k.end_date), 1)
dic = df2.groupby('entity').agg(lambda k: list(k)).to_dict()['j']
df[['entity', 'date']].transform(lambda x: any(x['date'] in z for z in dic[x['entity']]), 1)

Notice that this uses pd.Interval by default closed only on the right, but should be around 20x faster than chained transforms.

IMHO, depending on your data, sometimes it's acceptable to expand date range first

df2 = pd.concat([
    pd.DataFrame(pd.date_range(start_date, end_date), columns=['date']).assign(entity=entity)
    for _, (entity, start_date, end_date) in df2.iterrows()
]).drop_duplicates()
df.merge(df2, on=['entity', 'date'], how='left', indicator=True)['_merge'] == 'both'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM