[英]Remove Stop Words Python
所以我正在閱讀一個 csv 文件並在其中獲取單詞。 我正在嘗試刪除停用詞。 這是我的代碼。
import pandas as pd
from nltk.corpus import stopwords as sw
def loadCsv(fileName):
df = pd.read_csv(fileName, error_bad_lines=False)
df.dropna(inplace = True)
return df
def getWords(dataframe):
words = []
for tweet in dataframe['SentimentText'].tolist():
for word in tweet.split():
word = word.lower()
words.append(word)
return set(words) #Create a set from the words list
def removeStopWords(words):
for word in words: # iterate over word_list
if word in sw.words('english'):
words.remove(word) # remove word from filtered_word_list if it is a stopword
return set(words)
df = loadCsv("train.csv")
words = getWords(df)
words = removeStopWords(words)
在這條線上
if word in sw.words('english'):
我收到以下錯誤。
例外:沒有描述
更進一步,我將嘗試刪除標點符號,任何指向它的指針也會很棒。 任何幫助深表感謝。
編輯
def removeStopWords(words):
filtered_word_list = words #make a copy of the words
for word in words: # iterate over words
if word in sw.words('english'):
filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword
return set(filtered_word_list)
將 removeStopWords 函數更改為以下內容:
def getFilteredStopWords(words):
list_stopWords=list(set(sw.words('english')))
filtered_words=[w for w in words if not w in list_stopWords# remove word from filtered_words if it is a stopword
return filtered_words
這是問題的簡化版本,沒有 Panda。 我相信原始代碼的問題在於在迭代時修改設置的words
。 通過使用條件列表理解,我們可以測試每個單詞,創建一個新列表,並最終按照原始代碼將其轉換為一個集合。
from nltk.corpus import stopwords as sw
def removeStopWords(words):
return set([w for w in words if not w in sw.words('english')])
sentence = 'this is a very common english sentence with a finite set of words from my imagination'
words = set(sentence.split())
print(removeStopWords(words))
def remmove_stopwords(sentence):
list_stop_words = set(stopwords.words('english'))
words = sentence.split(' ')
filtered_words = [w for w in words if w not in list_stop_words]
sentence_list = ' '.join(w for w in filtered_words)
return sentence_list
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.