[英]LookupError while removing stop words from a list of column in pandas
我有一個100萬條記錄的數據集,如下所示
樣本DF1:-
articles_urlToImage feed_status status keyword
hhtps://rqqkf.com untagged tag the apple,a mobile phone
hhtps://hqkf.com tagged ingore blackberry, the a phone
hhtps://hqkf.com untagged tag amazon, an shopping site
現在我想刪除停用詞和一些自定義停用詞,如下所示
自定義停用詞= ['phone','site'](我大約有35個自定義停用詞)
預期投入
articles_urlToImage feed_status status keyword
hhtps://rqqkf.com untagged tag apple,mobile
hhtps://hqkf.com tagged ingore blackberry
hhtps://hqkf.com untagged tag amazon,shopping
我試圖刪除停用詞,但出現以下錯誤
碼
import nltk
import string
from nltk.corpus import stopwords
stop = stopwords.words('english')
df1['keyword'] = df1['keyword'].apply(lambda x: [item for item in x if item not in stop])
錯誤
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in __getattr__(self, name)
3612 if name in self._info_axis:
3613 return self[name]
-> 3614 return object.__getattribute__(self, name)
3615
3616 def __setattr__(self, name, value):
AttributeError: 'Series' object has no attribute 'split'
您可以使用:
from nltk.corpus import stopwords
stop = stopwords.words('english')
custom = ['phone','site']
#join lists together
stop = custom + stop
#remove punctuation, split by whitespace and remove stop words
df1['keyword'] = (df1['keyword'].str.replace(r'[^\w\s]+', ' ')
.apply(lambda x: [item for item in x.split() if item not in stop]))
print (df1)
articles_urlToImage feed_status status keyword
0 hhtps://rqqkf.com untagged tag [apple, mobile]
1 hhtps://hqkf.com tagged ingore [blackberry]
2 hhtps://hqkf.com untagged tag [amazon, shopping]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.