Python函数部分字符串匹配

Question

I have a pandas dataframe like this: 我有一个这样的熊猫数据框：

a      b      c
foo    bar    baz
bar    foo    baz
foobar barfoo baz

I've defined the following function in python: 我在python中定义了以下函数：

def somefunction (row):
    if row['a'] == 'foo' and row['b'] == 'bar':
        return 'yes'
    return 'no'

It works perfectly fine. 它工作得很好。 But I need to make a small tweak to the if function to take into account partial string matches. 但是我需要对if函数进行一些细微调整，以考虑partial string匹配。

I've tried several combinations, but I can't seem to get it to work. 我尝试了几种组合，但似乎无法正常工作。 I get the following error: 我收到以下错误：

("'str' object has no attribute 'str'", 'occurred at index 0')

The function Iv'e tried is: 我尝试的功能是：

def somenewfunction (row):
    if row['a'].str.contains('foo')==True and row['b'] == 'bar':
        return 'yes'
    return 'no'

Answer 1

Use contains for boolean mask and then numpy.where : 将contains用作布尔掩码，然后使用numpy.where ：

m = df['a'].str.contains('foo') & (df['b'] == 'bar')
print (m)
0     True
1    False
2    False
dtype: bool

df['new'] = np.where(m, 'yes', 'no')
print (df)
        a       b    c  new
0     foo     bar  baz  yes
1     bar     foo  baz   no
2  foobar  barfoo  baz   no

Or if need alo check column b for substrings: 或者，如果还需要检查b列中的子字符串：

m = df['a'].str.contains('foo') & df['b'].str.contains('bar')
df['new'] = np.where(m, 'yes', 'no')
print (df)
        a       b    c  new
0     foo     bar  baz  yes
1     bar     foo  baz   no
2  foobar  barfoo  baz  yes

If need custom function, what should be slowier in bigger DataFrame : 如果需要自定义功能，在更大的DataFrame应该更DataFrame ：

def somefunction (row):
    if 'foo' in row['a'] and row['b'] == 'bar':
        return 'yes'
    return 'no'

print (df.apply(somefunction, axis=1))
0    yes
1     no
2     no
dtype: object

def somefunction (row):
    if 'foo' in row['a']  and  'bar' in row['b']:
        return 'yes'
    return 'no'

print (df.apply(somefunction, axis=1))
0    yes
1     no
2    yes
dtype: object

Timings : 时间：

df = pd.concat([df]*1000).reset_index(drop=True)

def somefunction (row):
    if 'foo' in row['a'] and row['b'] == 'bar':
        return 'yes'
    return 'no'

In [269]: %timeit df['new'] = df.apply(somefunction, axis=1)
10 loops, best of 3: 60.7 ms per loop

In [270]: %timeit df['new1'] = np.where(df['a'].str.contains('foo') & (df['b'] == 'bar'), 'yes', 'no')
100 loops, best of 3: 3.25 ms per loop

df = pd.concat([df]*10000).reset_index(drop=True)

def somefunction (row):
    if 'foo' in row['a'] and row['b'] == 'bar':
        return 'yes'
    return 'no'

In [272]: %timeit df['new'] = df.apply(somefunction, axis=1)
1 loop, best of 3: 614 ms per loop

In [273]: %timeit df['new1'] = np.where(df['a'].str.contains('foo') & (df['b'] == 'bar'), 'yes', 'no')
10 loops, best of 3: 23.5 ms per loop

Answer 2

Your exception is probably from the fact that you write 您的例外可能是因为您编写

if row['a'].str.contains('foo')==True

Remove '.str': 删除“ .str”：

if row['a'].contains('foo')==True

Python函数部分字符串匹配

问题描述

2 个解决方案

解决方案1
1 已采纳 2017-11-02 12:12:32

解决方案2
1 2017-11-02 12:29:58

Python函数部分字符串匹配

问题描述

2 个解决方案

解决方案1 1 已采纳 2017-11-02 12:12:32

解决方案2 1 2017-11-02 12:29:58

解决方案1
1 已采纳 2017-11-02 12:12:32

解决方案2
1 2017-11-02 12:29:58