[英]Removing stop words that are not in NLTK library in python
我一直在嘗試從 NLTK 庫中找不到的 csv 文件中刪除停用詞,但是當我生成新的數據框時,應該“清理”了一個附加部分,我仍然看到其中的一些詞,並且我不知道如何刪除它們。 我不確定我的代碼有什么問題,但這里是:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus
import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(len(stop_words))
stop_words.extend(["consist", "feature", "site", "mound", "medium", "density", "enclosure"])
def clean_review(review_text):
# review_text = re.sub(r'http\S+','',review_text)
review_text = re.sub('[^a-zA-Z]',' ',str(review_text))
review_text = str(review_text).lower()
review_text = word_tokenize(review_text)
review_text = [word for word in review_text if word not in stop_words]
#review_text = [stemmer.stem(i) for i in review_text]
review_text = [lemma.lemmatize(word=w, pos='v') for w in review_text]
review_text = [i for i in review_text if len(i) > 2]
review_text = ' '.join(review_text)
return review_text
filename['New_Column']=filename['Column'].apply(clean_review)```
刪除停用詞后,您正在對文本進行詞形還原,這有時是可以的。
但是,您可能有詞在詞形還原后會出現在您的停用詞列表中
看例子
>>> import nltk
>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer = WordNetLemmatizer()
>>> print(lemmatizer.lemmatize("sites"))
site
>>>
起初,您的腳本不會刪除sites
,但在詞形還原后,它應該。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.