使用NLTK删除停用词时，对象没有属性

Question

I am trying to remove stopwords from the stopwords collection of NLTK from a pandas DataFrame that consists of rows of text data in Python 3: 我正在尝试从由Python 3中的文本数据行组成的pandas DataFrame的NLTK停用词集合中删除停用词：

import pandas as pd
from nltk.corpus import stopwords

file_path = '/users/rashid/desktop/webtext.csv'
doc = pd.read_csv(file_path, encoding = "ISO-8859-1")
texts = doc['text']
filter = texts != ""
dfNew = texts[filter]

stop = stopwords.words('english')
dfNew.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

I am getting this error: 我收到此错误：

'float' object has no attribute 'split'

Answer 1

Sounds like you have some numbers in your texts and they are causing pandas to get a little too smart. 听起来您的文字中有一些数字，它们使熊猫变得有点聪明。 Add the dtype option to pandas.read_csv() to ensure that everything in the column text is imported as a string: 添加dtype选项pandas.read_csv()以确保在列，一切text导入为一个字符串：

doc = pd.read_csv(file_path, encoding = "ISO-8859-1", dtype={'text':str})

Once you get your code working, you might notice it is slow: Looking things up in a list is inefficient. 一旦代码开始工作，您可能会注意到它很慢：在列表中查找内容效率很低。 Put your stopwords in a set like this, and you'll be amazed at the speedup. 将您的停用词放在这样的集合中，您将对加速感到惊讶。 (The in operator works with both sets and lists, but has a huge difference in speed.) （ in运算符可同时使用集合和列表，但是速度差异很大。）

stop = set(stopwords.words('english'))

Finally, change x.split() to nltk.word_tokenize(x) . 最后，将x.split()更改为nltk.word_tokenize(x) 。 If your data contains real text, this will separate punctuation from words and allow you to match stopwords properly. 如果您的数据包含真实文本，这会将标点符号与单词分开，并允许您正确匹配停用词。

使用NLTK删除停用词时，对象没有属性

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-12-02 09:40:13

使用NLTK删除停用词时，对象没有属性

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-12-02 09:40:13

解决方案1
2 已采纳 2018-12-02 09:40:13