简体   繁体   English

在带有apply语句的pandas中使用str.contains会引发str对象没有属性str错误

[英]Use str.contains in pandas with apply statement raises str object has not attribute str error

I thought it would be simply, but I'm stumbling on this for a while now.我认为这很简单,但我现在在这方面绊倒了一段时间。

I have a column containing several information but depending on some content I'd like to label ist with a category:我有一列包含多项信息,但根据某些内容,我想用类别标记 ist:

import pandas as pd
df = pd.DataFrame({"col1": ["A1 zwd fill text", "B2 rest uninteresting", "A1 more random text"]})

I assumed that this would do the trick df["col2"] = df.apply(lambda x: "Some Category" if x.col1.str.contains["A1"] else "Another Category", axis=1)我认为这可以解决问题df["col2"] = df.apply(lambda x: "Some Category" if x.col1.str.contains["A1"] else "Another Category", axis=1)

but it just raises a str object has not attribute str .但它只是引发了一个str object has not attribute strstr object has not attribute str Is it impossible to use str.contains with apply ?是否不可能将str.containsapply一起apply

Use numpy.where for a optimal vectorized solution, we don't need custom apply methods for such trivial actions:使用numpy.where以获得最佳矢量化解决方案,我们不需要自定义apply方法来执行此类微不足道的操作:

df['col2'] = np.where(df['col1'].str.contains('A1'), 'Some Category', 'Another Category')
           # np.where(<condition>, <value if true>, <value if false>)
                    col1              col2
0       A1 zwd fill text     Some Category
1  B2 rest uninteresting  Another Category
2    A1 more random text     Some Category

Or purely pandas using Series.map :或者纯粹使用Series.map熊猫:

df['col2'] = df['col1'].str.contains('A1').map({True: 'Some Category', 
                                                False: 'Another Category'})

                    col1              col2
0       A1 zwd fill text     Some Category
1  B2 rest uninteresting  Another Category
2    A1 more random text     Some Category

Timings :时间

# create test dataframe of 900k rows
df = pd.DataFrame({"col1": ["A1 zwd fill text", "B2 rest uninteresting", "A1 more random text"]})
dfbig = pd.concat([df]*300000, ignore_index=True)

Solution 1: np.where :解决方案 1: np.where

%%timeit
np.where(dfbig['col1'].str.contains('A1'), 'Some Category', 'Another Category')

855 ms ± 3.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Solution 2: Series.map :解决方案 2: Series.map

%%timeit
dfbig['col1'].str.contains('A1').map({True: 'Some Category', 
                                                False: 'Another Category'})

920 ms ± 15.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Solution 3: apply :解决方案3: apply

%%timeit
dfbig.apply(lambda x: "Some Category" if "A1" in x.col1 else "Another Category", axis=1)

28.5 s ± 446 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Conclusion : numpy is ~135 x faster than apply结论numpyapply~135 x

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM