[英]Python function partial string match
I have a pandas dataframe like this: 我有一个这样的熊猫数据框:
a b c
foo bar baz
bar foo baz
foobar barfoo baz
I've defined the following function in python: 我在python中定义了以下函数:
def somefunction (row):
if row['a'] == 'foo' and row['b'] == 'bar':
return 'yes'
return 'no'
It works perfectly fine. 它工作得很好。 But I need to make a small tweak to the
if
function to take into account partial string
matches. 但是我需要对
if
函数进行一些细微调整,以考虑partial string
匹配。
I've tried several combinations, but I can't seem to get it to work. 我尝试了几种组合,但似乎无法正常工作。 I get the following error:
我收到以下错误:
("'str' object has no attribute 'str'", 'occurred at index 0')
The function Iv'e tried is: 我尝试的功能是:
def somenewfunction (row):
if row['a'].str.contains('foo')==True and row['b'] == 'bar':
return 'yes'
return 'no'
Use contains
for boolean mask and then numpy.where
: 将
contains
用作布尔掩码,然后使用numpy.where
:
m = df['a'].str.contains('foo') & (df['b'] == 'bar')
print (m)
0 True
1 False
2 False
dtype: bool
df['new'] = np.where(m, 'yes', 'no')
print (df)
a b c new
0 foo bar baz yes
1 bar foo baz no
2 foobar barfoo baz no
Or if need alo check column b
for substrings: 或者,如果还需要检查
b
列中的子字符串:
m = df['a'].str.contains('foo') & df['b'].str.contains('bar')
df['new'] = np.where(m, 'yes', 'no')
print (df)
a b c new
0 foo bar baz yes
1 bar foo baz no
2 foobar barfoo baz yes
If need custom function, what should be slowier in bigger DataFrame
: 如果需要自定义功能,在更大的
DataFrame
应该更DataFrame
:
def somefunction (row):
if 'foo' in row['a'] and row['b'] == 'bar':
return 'yes'
return 'no'
print (df.apply(somefunction, axis=1))
0 yes
1 no
2 no
dtype: object
def somefunction (row):
if 'foo' in row['a'] and 'bar' in row['b']:
return 'yes'
return 'no'
print (df.apply(somefunction, axis=1))
0 yes
1 no
2 yes
dtype: object
Timings : 时间 :
df = pd.concat([df]*1000).reset_index(drop=True)
def somefunction (row):
if 'foo' in row['a'] and row['b'] == 'bar':
return 'yes'
return 'no'
In [269]: %timeit df['new'] = df.apply(somefunction, axis=1)
10 loops, best of 3: 60.7 ms per loop
In [270]: %timeit df['new1'] = np.where(df['a'].str.contains('foo') & (df['b'] == 'bar'), 'yes', 'no')
100 loops, best of 3: 3.25 ms per loop
df = pd.concat([df]*10000).reset_index(drop=True)
def somefunction (row):
if 'foo' in row['a'] and row['b'] == 'bar':
return 'yes'
return 'no'
In [272]: %timeit df['new'] = df.apply(somefunction, axis=1)
1 loop, best of 3: 614 ms per loop
In [273]: %timeit df['new1'] = np.where(df['a'].str.contains('foo') & (df['b'] == 'bar'), 'yes', 'no')
10 loops, best of 3: 23.5 ms per loop
Your exception is probably from the fact that you write 您的例外可能是因为您编写
if row['a'].str.contains('foo')==True
Remove '.str': 删除“ .str”:
if row['a'].contains('foo')==True
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.