Pandas Dataframe - 如何檢查A列中的字符串值是否在B列中的字符串項列表中可用

Question

這是我的數據框，它有兩列：列A包含字符串，列B包含字符串列表。

import pandas as pd

df = pd.DataFrame(columns=['A','B'])
df.loc[0] = ['apple',['orange','banana','blueberry']]
df.loc[1] = ['orange',['orange','banana','avocado']]
df.loc[2] = ['blueberry',['apple','banana','blueberry']]
df.loc[3] = ['cherry',['apple','orange','banana']]

print(df)

           A                            B
0      apple  [orange, banana, blueberry]
1     orange    [orange, banana, avocado]
2  blueberry   [apple, banana, blueberry]
3     cherry      [apple, orange, banana]

我想檢查每一行，看看列A中的值是否列在同一行的B列的列表中。 所以，預期的輸出應該是：

0 False
1 True
2 True
3 False

我試過isin ，它可以檢查靜態列表：

df.A.isin(['orange','banana','blueberry'])
0    False
1     True
2    False
3    False

但是，當我嘗試使用它來檢查數據框中的列表項時，它不起作用：

df.A.isin(df.B)
TypeError: unhashable type: 'list'

如果使用Pandas有可用的解決方案，我想避免使用for循環和lambda。

任何幫助是極大的贊賞。

Answer 1

`sets`樂趣

df.A.apply(lambda x: set([x])) <= df.B.apply(set)

0    False
1     True
2     True
3    False
dtype: bool

沒有循環

但我仍然使用@ jezrael的理解力

pd.DataFrame(df.B.tolist(), df.index).eq(df.A, 0).any(1)

0    False
1     True
2     True
3    False
dtype: bool

Numpy廣播

僅在B中的每個列表具有相同長度時才有效。

from numpy.core.defchararray import equal

pd.Series(
    equal(df.A.values.astype(str), np.array(df.B.tolist()).T).any(0),
    df.index
)

0    False
1     True
2     True
3    False
dtype: bool

`pd.get_dummies`

df.B.str.join('|').str.get_dummies().mul(pd.get_dummies(df.A)).any(1)

0    False
1     True
2     True
3    False
dtype: bool

`np.bincount`

我喜歡這一個（-：
然而，jezrael注意到表現不佳） - ：所以要小心。

i = np.arange(len(df)).repeat(df.B.str.len())
pd.Series(
    np.bincount(i, df.A.values[i] == np.concatenate(df.B)).astype(bool),
    df.index
)

0    False
1     True
2     True
3    False
dtype: bool

Answer 2

最快的是純粹的列表理解與通過檢查in ：

m = pd.Series([i in j for i, j in zip(df.A, df.B)], index=x.index)
print (m)
0    False
1     True
2     True
3    False
dtype: bool

apply解決方案：

m = df.apply(lambda x: x.A in x.B, axis=1)
print (m)
0    False
1     True
2     True
3    False
dtype: bool

謝謝@pir使用圖表時序解決方案：

from numpy.core.defchararray import equal

def jez1(x):
    return pd.Series([i in j for i, j in zip(x.A, x.B)], index=x.index)

def jez2(x):
    return x.apply(lambda x: x.A in x.B, axis=1)

def pir1(x):
    return x.A.apply(lambda x: set([x])) <= x.B.apply(set)
def pir2(x):
    return pd.DataFrame(x.B.tolist(), x.index).eq(x.A, 0).any(1)
def pir3(x):
    return x.B.str.join('|').str.get_dummies().mul(pd.get_dummies(x.A)).any(1)

def pir4(x):
    return pd.Series(equal(x.A.values.astype(str), np.array(x.B.tolist()).T).any(0),x.index)

def pir5(x):   
    i = np.arange(len(x)).repeat(x.B.str.len())
    return pd.Series(np.bincount(i, x.A.values[i] == np.concatenate(x.B)).astype(bool),x.index)

res = pd.DataFrame(
    index=[10, 100, 500, 1000],
    columns='jez1 jez2 pir1 pir2 pir3 pir4 pir5'.split(),
    dtype=float
)

for i in res.index:
    d = pd.concat([df] * i, ignore_index=True)
    for j in res.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        res.at[i, j] = timeit(stmt, setp, number=100)

print (res.div(res.min(1), 0))
      jez1        jez2      pir1      pir2       pir3      pir4        pir5
10     1.0   13.235732  4.984622  5.687160  38.796462  1.132400    7.283616
100    1.0   79.879019  6.515313  5.159239  82.787444  1.963980   65.205917
500    1.0  162.672370  6.255446  2.761716  51.753635  3.506066   88.300689
1000   1.0  196.374333  8.813674  2.908213  63.753664  4.797193  125.889481

res.plot(loglog=True)

Pandas Dataframe - 如何檢查A列中的字符串值是否在B列中的字符串項列表中可用

問題描述

2 個解決方案

解決方案1
4 已采納 2018-03-23 06:22:40

`sets`樂趣

沒有循環

Numpy廣播

`pd.get_dummies`

`np.bincount`

解決方案2
3 2018-03-23 06:21:13

Pandas Dataframe - 如何檢查A列中的字符串值是否在B列中的字符串項列表中可用

問題描述

2 個解決方案

解決方案1 4 已采納 2018-03-23 06:22:40

sets樂趣

沒有循環

Numpy廣播

pd.get_dummies

np.bincount

解決方案2 3 2018-03-23 06:21:13

解決方案1
4 已采納 2018-03-23 06:22:40

`sets`樂趣

`pd.get_dummies`

`np.bincount`

解決方案2
3 2018-03-23 06:21:13