[英]Use str.contains in pandas with apply statement raises str object has not attribute str error
I thought it would be simply, but I'm stumbling on this for a while now.我认为这很简单,但我现在在这方面绊倒了一段时间。
I have a column containing several information but depending on some content I'd like to label ist with a category:我有一列包含多项信息,但根据某些内容,我想用类别标记 ist:
import pandas as pd
df = pd.DataFrame({"col1": ["A1 zwd fill text", "B2 rest uninteresting", "A1 more random text"]})
I assumed that this would do the trick df["col2"] = df.apply(lambda x: "Some Category" if x.col1.str.contains["A1"] else "Another Category", axis=1)
我认为这可以解决问题
df["col2"] = df.apply(lambda x: "Some Category" if x.col1.str.contains["A1"] else "Another Category", axis=1)
but it just raises a str object has not attribute str
.但它只是引发了一个
str object has not attribute str
的str object has not attribute str
。 Is it impossible to use str.contains
with apply
?是否不可能将
str.contains
与apply
一起apply
?
Use numpy.where
for a optimal vectorized solution, we don't need custom apply
methods for such trivial actions:使用
numpy.where
以获得最佳矢量化解决方案,我们不需要自定义apply
方法来执行此类微不足道的操作:
df['col2'] = np.where(df['col1'].str.contains('A1'), 'Some Category', 'Another Category')
# np.where(<condition>, <value if true>, <value if false>)
col1 col2
0 A1 zwd fill text Some Category
1 B2 rest uninteresting Another Category
2 A1 more random text Some Category
Or purely pandas using Series.map
:或者纯粹使用
Series.map
熊猫:
df['col2'] = df['col1'].str.contains('A1').map({True: 'Some Category',
False: 'Another Category'})
col1 col2
0 A1 zwd fill text Some Category
1 B2 rest uninteresting Another Category
2 A1 more random text Some Category
Timings :时间:
# create test dataframe of 900k rows
df = pd.DataFrame({"col1": ["A1 zwd fill text", "B2 rest uninteresting", "A1 more random text"]})
dfbig = pd.concat([df]*300000, ignore_index=True)
Solution 1: np.where
:解决方案 1:
np.where
:
%%timeit
np.where(dfbig['col1'].str.contains('A1'), 'Some Category', 'Another Category')
855 ms ± 3.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 2: Series.map
:解决方案 2:
Series.map
:
%%timeit
dfbig['col1'].str.contains('A1').map({True: 'Some Category',
False: 'Another Category'})
920 ms ± 15.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 3: apply
:解决方案3:
apply
:
%%timeit
dfbig.apply(lambda x: "Some Category" if "A1" in x.col1 else "Another Category", axis=1)
28.5 s ± 446 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Conclusion : numpy
is ~135 x
faster than apply
结论:
numpy
比apply
快~135 x
倍
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.