Pandas Dataframe - 如何检查A列中的字符串值是否在B列中的字符串项列表中可用

Question

这是我的数据框，它有两列：列A包含字符串，列B包含字符串列表。

import pandas as pd

df = pd.DataFrame(columns=['A','B'])
df.loc[0] = ['apple',['orange','banana','blueberry']]
df.loc[1] = ['orange',['orange','banana','avocado']]
df.loc[2] = ['blueberry',['apple','banana','blueberry']]
df.loc[3] = ['cherry',['apple','orange','banana']]

print(df)

           A                            B
0      apple  [orange, banana, blueberry]
1     orange    [orange, banana, avocado]
2  blueberry   [apple, banana, blueberry]
3     cherry      [apple, orange, banana]

我想检查每一行，看看列A中的值是否列在同一行的B列的列表中。 所以，预期的输出应该是：

0 False
1 True
2 True
3 False

我试过isin ，它可以检查静态列表：

df.A.isin(['orange','banana','blueberry'])
0    False
1     True
2    False
3    False

但是，当我尝试使用它来检查数据框中的列表项时，它不起作用：

df.A.isin(df.B)
TypeError: unhashable type: 'list'

如果使用Pandas有可用的解决方案，我想避免使用for循环和lambda。

任何帮助是极大的赞赏。

Answer 1

`sets`乐趣

df.A.apply(lambda x: set([x])) <= df.B.apply(set)

0    False
1     True
2     True
3    False
dtype: bool

没有循环

但我仍然使用@ jezrael的理解力

pd.DataFrame(df.B.tolist(), df.index).eq(df.A, 0).any(1)

0    False
1     True
2     True
3    False
dtype: bool

Numpy广播

仅在B中的每个列表具有相同长度时才有效。

from numpy.core.defchararray import equal

pd.Series(
    equal(df.A.values.astype(str), np.array(df.B.tolist()).T).any(0),
    df.index
)

0    False
1     True
2     True
3    False
dtype: bool

`pd.get_dummies`

df.B.str.join('|').str.get_dummies().mul(pd.get_dummies(df.A)).any(1)

0    False
1     True
2     True
3    False
dtype: bool

`np.bincount`

我喜欢这一个（-：
然而，jezrael注意到表现不佳） - ：所以要小心。

i = np.arange(len(df)).repeat(df.B.str.len())
pd.Series(
    np.bincount(i, df.A.values[i] == np.concatenate(df.B)).astype(bool),
    df.index
)

0    False
1     True
2     True
3    False
dtype: bool

Answer 2

最快的是纯粹的列表理解与通过检查in ：

m = pd.Series([i in j for i, j in zip(df.A, df.B)], index=x.index)
print (m)
0    False
1     True
2     True
3    False
dtype: bool

apply解决方案：

m = df.apply(lambda x: x.A in x.B, axis=1)
print (m)
0    False
1     True
2     True
3    False
dtype: bool

谢谢@pir使用图表时序解决方案：

from numpy.core.defchararray import equal

def jez1(x):
    return pd.Series([i in j for i, j in zip(x.A, x.B)], index=x.index)

def jez2(x):
    return x.apply(lambda x: x.A in x.B, axis=1)

def pir1(x):
    return x.A.apply(lambda x: set([x])) <= x.B.apply(set)
def pir2(x):
    return pd.DataFrame(x.B.tolist(), x.index).eq(x.A, 0).any(1)
def pir3(x):
    return x.B.str.join('|').str.get_dummies().mul(pd.get_dummies(x.A)).any(1)

def pir4(x):
    return pd.Series(equal(x.A.values.astype(str), np.array(x.B.tolist()).T).any(0),x.index)

def pir5(x):   
    i = np.arange(len(x)).repeat(x.B.str.len())
    return pd.Series(np.bincount(i, x.A.values[i] == np.concatenate(x.B)).astype(bool),x.index)

res = pd.DataFrame(
    index=[10, 100, 500, 1000],
    columns='jez1 jez2 pir1 pir2 pir3 pir4 pir5'.split(),
    dtype=float
)

for i in res.index:
    d = pd.concat([df] * i, ignore_index=True)
    for j in res.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        res.at[i, j] = timeit(stmt, setp, number=100)

print (res.div(res.min(1), 0))
      jez1        jez2      pir1      pir2       pir3      pir4        pir5
10     1.0   13.235732  4.984622  5.687160  38.796462  1.132400    7.283616
100    1.0   79.879019  6.515313  5.159239  82.787444  1.963980   65.205917
500    1.0  162.672370  6.255446  2.761716  51.753635  3.506066   88.300689
1000   1.0  196.374333  8.813674  2.908213  63.753664  4.797193  125.889481

res.plot(loglog=True)

Pandas Dataframe - 如何检查A列中的字符串值是否在B列中的字符串项列表中可用

问题描述

2 个解决方案

解决方案1
4 已采纳 2018-03-23 06:22:40

`sets`乐趣

没有循环

Numpy广播

`pd.get_dummies`

`np.bincount`

解决方案2
3 2018-03-23 06:21:13

Pandas Dataframe - 如何检查A列中的字符串值是否在B列中的字符串项列表中可用

问题描述

2 个解决方案

解决方案1 4 已采纳 2018-03-23 06:22:40

sets乐趣

没有循环

Numpy广播

pd.get_dummies

np.bincount

解决方案2 3 2018-03-23 06:21:13

解决方案1
4 已采纳 2018-03-23 06:22:40

`sets`乐趣

`pd.get_dummies`

`np.bincount`

解决方案2
3 2018-03-23 06:21:13