[英]Pandas DataFrame check if column value exists in a group of columns
I have a DataFrame like this (simplified example) 我有这样的DataFrame(简化示例)
id v0 v1 v2 v3 v4
1 10 5 10 22 50
2 22 23 55 60 50
3 8 2 40 80 110
4 15 15 25 100 101
And would like to create an additional column that is either 1 or 0. 1 if v0 value is in the values of v1 to v4, and 0 if it's not. 并且如果v0值在v1到v4的值中,则想要创建一个1或0的附加列,如果不是,则创建0。 So, in this example for id 1 then the value should be 1 (since v2 = 10) and for id 2 value should be 0 since 22 is not in v1 thru v4.
因此,在此示例中,对于id 1,则该值应为1(因为v2 = 10),并且对于id 2值应该为0,因为22不在v1到v4中。
In reality the table is way bigger (around 100,000 rows and variables go from v1 to v99). 实际上,表格更大(大约100,000行,变量从v1到v99)。
You can use the underlying numpy
arrays for performance: 您可以使用底层的
numpy
数组来提高性能:
Setup 设定
a = df.v0.values
b = df.iloc[:, 2:].values
df.assign(out=(a[:, None]==b).any(1).astype(int))
id v0 v1 v2 v3 v4 out
0 1 10 5 10 22 50 1
1 2 22 23 55 60 50 0
2 3 8 2 40 80 110 0
3 4 15 15 25 100 101 1
This solution leverages broadcasting to allow for pairwise comparison: 该解决方案利用广播来进行成对比较:
First, we broadcast a
: 首先,我们播出
a
:
>>> a[:, None]
array([[10],
[22],
[ 8],
[15]], dtype=int64)
Which allows for pairwise comparison with b
: 这允许与
b
成对比较:
>>> a[:, None] == b
array([[False, True, False, False],
[False, False, False, False],
[False, False, False, False],
[ True, False, False, False]])
We then simply check for any True
results along the first axis, and convert to integer. 然后,我们只需检查沿第一个轴的任何
True
结果,并转换为整数。
Performance 性能
Functions 职能
def user_chris(df):
a = df.v0.values
b = df.iloc[:, 2:].values
return (a[:, None]==b).any(1).astype(int)
def rahlf23(df):
df = df.set_index('id')
return df.drop('v0', 1).isin(df['v0']).any(1).astype(int)
def chris_a(df):
return df.loc[:, "v1":].eq(df['v0'], 0).any(1).astype(int)
def chris(df):
return df.apply(lambda x: int(x['v0'] in x.values[2:]), axis=1)
def anton_vbr(df):
df.set_index('id', inplace=True)
return df.isin(df.pop('v0')).any(1).astype(int)
Setup 设定
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from timeit import timeit
res = pd.DataFrame(
index=['user_chris', 'rahlf23', 'chris_a', 'chris', 'anton_vbr'],
columns=[10, 50, 100, 500, 1000, 5000],
dtype=float
)
for f in res.index:
for c in res.columns:
vals = np.random.randint(1, 100, (c, c))
vals = np.column_stack((np.arange(vals.shape[0]), vals))
df = pd.DataFrame(vals, columns=['id'] + [f'v{i}' for i in range(0, vals.shape[0])])
stmt = '{}(df)'.format(f)
setp = 'from __main__ import df, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=50)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");
plt.show()
Output 产量
How about: 怎么样:
df['new_col'] = df.loc[:, "v1":].eq(df['v0'],0).any(1).astype(int)
[out] [OUT]
id v0 v1 v2 v3 v4 new_col
0 1 10 5 10 22 50 1
1 2 22 23 55 60 50 0
2 3 8 2 40 80 110 0
3 4 15 15 25 100 101 1
I'm assuming here that id
is set to be your dataframe index here: 我在这里假设
id
设置为您的数据帧索引:
df = df.set_index('id')
Then the following should work (similar answer here ): 然后以下应该工作(类似的答案在这里 ):
df['New'] = df.drop('v0', 1).isin(df['v0']).any(1).astype(int)
Gives: 得到:
v0 v1 v2 v3 v4 New
id
1 10 5 10 22 50 1
2 22 23 55 60 50 0
3 8 2 40 80 110 0
4 15 15 25 100 101 1
You can also use a lambda function: 您还可以使用lambda函数:
df['newCol'] = df.apply(lambda x: int(x['v0'] in x.values[2:]), axis=1)
id v0 v1 v2 v3 v4 newCol
0 1 10 5 10 22 50 1
1 2 22 23 55 60 50 0
2 3 8 2 40 80 110 0
3 4 15 15 25 100 101 1
Another take, most likely the smallest syntax: 另一种看法,很可能是最小的语法:
df['new'] = df.isin(df.pop('v0')).any(1).astype(int)
Full proof: 完整证明:
import pandas as pd
data = '''\
id v0 v1 v2 v3 v4
1 10 5 10 22 50
2 22 23 55 60 50
3 8 2 40 80 110
4 15 15 25 100 101'''
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
df.set_index('id', inplace=True)
df['new'] = df.isin(df.pop('v0')).any(1).astype(int)
print(df)
Returns: 返回:
v1 v2 v3 v4 new
id
1 5 10 22 50 1
2 23 55 60 50 0
3 2 40 80 110 0
4 15 25 100 101 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.