在带有apply语句的pandas中使用str.contains会引发str对象没有属性str错误

Question

I thought it would be simply, but I'm stumbling on this for a while now.我认为这很简单，但我现在在这方面绊倒了一段时间。

I have a column containing several information but depending on some content I'd like to label ist with a category:我有一列包含多项信息，但根据某些内容，我想用类别标记 ist：

import pandas as pd
df = pd.DataFrame({"col1": ["A1 zwd fill text", "B2 rest uninteresting", "A1 more random text"]})

I assumed that this would do the trick df["col2"] = df.apply(lambda x: "Some Category" if x.col1.str.contains["A1"] else "Another Category", axis=1)我认为这可以解决问题df["col2"] = df.apply(lambda x: "Some Category" if x.col1.str.contains["A1"] else "Another Category", axis=1)

but it just raises a str object has not attribute str .但它只是引发了一个str object has not attribute str的str object has not attribute str 。 Is it impossible to use str.contains with apply ?是否不可能将str.contains与apply一起apply ？

Answer 1

Use numpy.where for a optimal vectorized solution, we don't need custom apply methods for such trivial actions:使用numpy.where以获得最佳矢量化解决方案，我们不需要自定义apply方法来执行此类微不足道的操作：

df['col2'] = np.where(df['col1'].str.contains('A1'), 'Some Category', 'Another Category')
           # np.where(<condition>, <value if true>, <value if false>)
                    col1              col2
0       A1 zwd fill text     Some Category
1  B2 rest uninteresting  Another Category
2    A1 more random text     Some Category

Or purely pandas using Series.map :或者纯粹使用Series.map熊猫：

df['col2'] = df['col1'].str.contains('A1').map({True: 'Some Category', 
                                                False: 'Another Category'})

                    col1              col2
0       A1 zwd fill text     Some Category
1  B2 rest uninteresting  Another Category
2    A1 more random text     Some Category

Timings :时间：

# create test dataframe of 900k rows
df = pd.DataFrame({"col1": ["A1 zwd fill text", "B2 rest uninteresting", "A1 more random text"]})
dfbig = pd.concat([df]*300000, ignore_index=True)

Solution 1: np.where :解决方案 1： np.where ：

%%timeit
np.where(dfbig['col1'].str.contains('A1'), 'Some Category', 'Another Category')

855 ms ± 3.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Solution 2: Series.map :解决方案 2： Series.map ：

%%timeit
dfbig['col1'].str.contains('A1').map({True: 'Some Category', 
                                                False: 'Another Category'})

920 ms ± 15.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Solution 3: apply :解决方案3： apply ：

%%timeit
dfbig.apply(lambda x: "Some Category" if "A1" in x.col1 else "Another Category", axis=1)

28.5 s ± 446 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Conclusion : numpy is ~135 x faster than apply结论： numpy比apply快~135 x倍

在带有apply语句的pandas中使用str.contains会引发str对象没有属性str错误

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-01-27 22:43:34

在带有apply语句的pandas中使用str.contains会引发str对象没有属性str错误

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-01-27 22:43:34

解决方案1
2 已采纳 2020-01-27 22:43:34