简体   繁体   中英

find if the value in pandas column is in another column's list

I have a df that looks like this:

    ID       dates                              holidays
0  ABC  01/01/2022  [12/25/2021, 01/01/2022, 01/17/2022]
1  DEF  03/17/2022  [01/01/2022, 06/01/2022, 27/01/2022]

I need to create another column isHoliday with bool value if the value in column dates is in holidays list for that id. I am trying to use the code below but it shows an error: TypeError: unhashable type: 'list' :

df['isHoliday'] = np.where(df['dates'].isin(df['holidays']), True, False)

Is it any way to find whether the columns dates value is in the list that corresponds each ID?

explode works really well here:

df['isHoliday'] = df.eval('dates in holidays.explode()')

Output:

>>> df
    ID       dates                              holidays  isHoliday
0  ABC  01/01/2022  [12/25/2021, 01/01/2022, 01/17/2022]       True
1  DEF  03/17/2022  [01/01/2022, 06/01/2022, 27/01/2022]      False

Benchmarks:

def scott():
    [d in l for d, l in zip(df['dates'], df['holidays'])]

def shubham():
    pd.DataFrame([*df['holidays']]).eq(df['dates'], axis=0).any(1)

def charles():
    df[['dates','holidays']].apply(lambda row: row['dates'] in row['holidays'], axis=1)

def richardec():
    df.eval('dates in holidays.explode()')

###

%timeit scott()
# 5.99 µs ± 302 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

%timeit shubham()
# 547 µs ± 10.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit charles()
# 525 µs ± 2.27 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit richardec()
# 971 µs ± 71.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

So mine is the shortest, but Scott's is the fastest!

Try this:

df['isholiday'] = [d in l for d, l in zip(df['dates'], df['holidays'])]

Output:

    ID       dates                              holidays  isholiday
0  ABC  01/01/2022  [12/25/2021, 01/01/2022, 01/17/2022]       True
1  DEF  03/17/2022  [01/01/2022, 06/01/2022, 01/27/2022]      False

Here is a vectorized approach:

df['isholiday'] = pd.DataFrame([*df['holidays']]).eq(df['dates'], axis=0).any(1)

Result

    ID       dates                                holidays  isholiday
0  ABC  01/01/2022    [12/25/2021, 01/01/2022, 01/17/2022]       True
1  DEF  03/17/2022  [01/01/2022,  06/01/2022,  27/01/2022]      False

The source of the issue is that the .isin() method does not vectorize its argument, so it's checking if each date is in the whole series of lists, and in the process I believe it casts the argument to a set for speed purposes. Sets only accept immutable values, and lists are mutable, so it gives the unhashable type error.

I think you'll want to use an apply instead:

df['isHoliday'] = df[['dates','holidays']].apply(
    lambda row: row['dates'] in row['holidays'], 
    axis=1
)

In this case you don't need the np.where because the condition itself returns a series of True s and False s.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM