简体   繁体   中英

Pandas DataFrame check if column value exists in a group of columns

I have a DataFrame like this (simplified example)

id  v0  v1  v2  v3  v4
1   10  5   10  22  50
2   22  23  55  60  50
3   8   2   40  80  110
4   15  15  25  100 101

And would like to create an additional column that is either 1 or 0. 1 if v0 value is in the values of v1 to v4, and 0 if it's not. So, in this example for id 1 then the value should be 1 (since v2 = 10) and for id 2 value should be 0 since 22 is not in v1 thru v4.

In reality the table is way bigger (around 100,000 rows and variables go from v1 to v99).

You can use the underlying numpy arrays for performance:

Setup

a = df.v0.values
b = df.iloc[:, 2:].values

df.assign(out=(a[:, None]==b).any(1).astype(int))

   id  v0  v1  v2   v3   v4  out
0   1  10   5  10   22   50    1
1   2  22  23  55   60   50    0
2   3   8   2  40   80  110    0
3   4  15  15  25  100  101    1

This solution leverages broadcasting to allow for pairwise comparison:

First, we broadcast a :

>>> a[:, None]
array([[10],
       [22],
       [ 8],
       [15]], dtype=int64)

Which allows for pairwise comparison with b :

>>> a[:, None] == b
array([[False,  True, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [ True, False, False, False]])

We then simply check for any True results along the first axis, and convert to integer.


Performance


Functions

def user_chris(df):
    a = df.v0.values
    b = df.iloc[:, 2:].values
    return (a[:, None]==b).any(1).astype(int)

def rahlf23(df):
    df = df.set_index('id')
    return df.drop('v0', 1).isin(df['v0']).any(1).astype(int)

def chris_a(df):
    return df.loc[:, "v1":].eq(df['v0'], 0).any(1).astype(int)

def chris(df):
    return df.apply(lambda x: int(x['v0'] in x.values[2:]), axis=1)

def anton_vbr(df):
    df.set_index('id', inplace=True)
    return df.isin(df.pop('v0')).any(1).astype(int)

Setup

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from timeit import timeit

res = pd.DataFrame(
       index=['user_chris', 'rahlf23', 'chris_a', 'chris', 'anton_vbr'],
       columns=[10, 50, 100, 500, 1000, 5000],
       dtype=float
)

for f in res.index:
    for c in res.columns:
        vals = np.random.randint(1, 100, (c, c))
        vals = np.column_stack((np.arange(vals.shape[0]), vals))
        df = pd.DataFrame(vals, columns=['id'] + [f'v{i}' for i in range(0, vals.shape[0])])
        stmt = '{}(df)'.format(f)
        setp = 'from __main__ import df, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");

plt.show()

Output

在此输入图像描述

How about:

df['new_col'] = df.loc[:, "v1":].eq(df['v0'],0).any(1).astype(int)

[out]

   id  v0  v1  v2   v3   v4  new_col
0   1  10   5  10   22   50        1
1   2  22  23  55   60   50        0
2   3   8   2  40   80  110        0
3   4  15  15  25  100  101        1

I'm assuming here that id is set to be your dataframe index here:

df = df.set_index('id')

Then the following should work (similar answer here ):

df['New'] = df.drop('v0', 1).isin(df['v0']).any(1).astype(int)

Gives:

    v0  v1  v2   v3   v4  New
id                           
1   10   5  10   22   50    1
2   22  23  55   60   50    0
3    8   2  40   80  110    0
4   15  15  25  100  101    1

You can also use a lambda function:

df['newCol'] = df.apply(lambda x: int(x['v0'] in x.values[2:]), axis=1)

    id  v0  v1  v2  v3  v4  newCol
0   1   10  5   10  22  50  1
1   2   22  23  55  60  50  0
2   3   8   2   40  80  110 0
3   4   15  15  25  100 101 1

Another take, most likely the smallest syntax:

df['new'] = df.isin(df.pop('v0')).any(1).astype(int)

Full proof:

import pandas as pd

data = '''\
id  v0  v1  v2  v3  v4
1   10  5   10  22  50
2   22  23  55  60  50
3   8   2   40  80  110
4   15  15  25  100 101'''

df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
df.set_index('id', inplace=True)
df['new'] = df.isin(df.pop('v0')).any(1).astype(int)
print(df)

Returns:

    v1  v2   v3   v4  new
id                       
1    5  10   22   50    1
2   23  55   60   50    0
3    2  40   80  110    0
4   15  25  100  101    1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM