使用NLTK刪除停用詞時，對象沒有屬性

Question

我正在嘗試從由Python 3中的文本數據行組成的pandas DataFrame的NLTK停用詞集合中刪除停用詞：

import pandas as pd
from nltk.corpus import stopwords

file_path = '/users/rashid/desktop/webtext.csv'
doc = pd.read_csv(file_path, encoding = "ISO-8859-1")
texts = doc['text']
filter = texts != ""
dfNew = texts[filter]

stop = stopwords.words('english')
dfNew.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

我收到此錯誤：

'float' object has no attribute 'split'

Answer 1

聽起來您的文字中有一些數字，它們使熊貓變得有點聰明。 添加dtype選項pandas.read_csv()以確保在列，一切text導入為一個字符串：

doc = pd.read_csv(file_path, encoding = "ISO-8859-1", dtype={'text':str})

一旦代碼開始工作，您可能會注意到它很慢：在列表中查找內容效率很低。 將您的停用詞放在這樣的集合中，您將對加速感到驚訝。 （ in運算符可同時使用集合和列表，但是速度差異很大。）

stop = set(stopwords.words('english'))

最后，將x.split()更改為nltk.word_tokenize(x) 。 如果您的數據包含真實文本，這會將標點符號與單詞分開，並允許您正確匹配停用詞。

使用NLTK刪除停用詞時，對象沒有屬性

問題描述

1 個解決方案

解決方案1
2 已采納 2018-12-02 09:40:13

使用NLTK刪除停用詞時，對象沒有屬性

問題描述

1 個解決方案

解決方案1 2 已采納 2018-12-02 09:40:13

解決方案1
2 已采納 2018-12-02 09:40:13