[英]How to remove stop words using nltk or python
我有一個數據集,我想從中刪除停用詞。
我使用 NLTK 獲取停用詞列表:
from nltk.corpus import stopwords
stopwords.words('english')
究竟如何將數據與停用詞列表進行比較,從而從數據中刪除停用詞?
from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]
您還可以設置差異,例如:
list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))
要排除所有類型的停用詞,包括 nltk 停用詞,您可以執行以下操作:
from stop_words import get_stop_words
from nltk.corpus import stopwords
stop_words = list(get_stop_words('en')) #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)
output = [w for w in word_list if not w in stop_words]
我想您有一個要從中刪除停用詞的單詞列表 (word_list)。 你可以這樣做:
filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
if word in stopwords.words('english'):
filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword
為此,有一個非常簡單的輕量級 python 包stop-words
。
首先使用以下命令安裝軟件包: pip install stop-words
然后你可以使用列表理解在一行中刪除你的單詞:
from stop_words import get_stop_words
filtered_words = [word for word in dataset if word not in get_stop_words('english')]
這個包下載非常輕量級(與 nltk 不同),適用於Python 2
和Python 3
,並且它有許多其他語言的停用詞,例如:
Arabic
Bulgarian
Catalan
Czech
Danish
Dutch
English
Finnish
French
German
Hungarian
Indonesian
Italian
Norwegian
Polish
Portuguese
Romanian
Russian
Spanish
Swedish
Turkish
Ukrainian
使用textcleaner庫從數據中刪除停用詞。
按照此鏈接: https : //yugantm.github.io/textcleaner/documentation.html#remove_stpwrds
請按照以下步驟使用此庫執行此操作。
pip install textcleaner
安裝后:
import textcleaner as tc
data = tc.document(<file_name>)
#you can also pass list of sentences to the document class constructor.
data.remove_stpwrds() #inplace is set to False by default
使用上面的代碼刪除停用詞。
這是我對此的看法,以防您想立即將答案轉換為字符串(而不是過濾詞列表):
STOPWORDS = set(stopwords.words('english'))
text = ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text
你可以使用這個功能,你應該注意到你需要降低所有的單詞
from nltk.corpus import stopwords
def remove_stopwords(word_list):
processed_word_list = []
for word in word_list:
word = word.lower() # in case they arenet all lower cased
if word not in stopwords.words("english"):
processed_word_list.append(word)
return processed_word_list
使用過濾器:
from nltk.corpus import stopwords
# ...
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list))
雖然問題有點老,但這里有一個新庫,值得一提,可以做額外的任務。
在某些情況下,您不想只刪除停用詞。 相反,您可能希望在文本數據中找到停用詞並將其存儲在列表中,以便您可以找到數據中的噪音並使其更具交互性。
該庫稱為'textfeatures'
。 您可以按如下方式使用它:
! pip install textfeatures
import textfeatures as tf
import pandas as pd
例如,假設您有以下一組字符串:
texts = [
"blue car and blue window",
"black crow in the window",
"i see my reflection in the window"]
df = pd.DataFrame(texts) # Convert to a dataframe
df.columns = ['text'] # give a name to the column
df
現在,調用 stopwords() 函數並傳遞您想要的參數:
tf.stopwords(df,"text","stopwords") # extract stop words
df[["text","stopwords"]].head() # give names to columns
結果將是:
text stopwords
0 blue car and blue window [and]
1 black crow in the window [in, the]
2 i see my reflection in the window [i, my, in, the]
如您所見,最后一列包含該文檔(記錄)中包含的停用詞。
如果您的數據存儲為Pandas DataFrame
,您可以使用remove_stopwords
的 remove_stopwords, 默認情況下使用 NLTK 停用詞列表。
import pandas as pd
import texthero as hero
df['text_without_stopwords'] = hero.remove_stopwords(df['text'])
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
example_sent = "This is a sample sentence, showing off the stop words filtration."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
print(word_tokens)
print(filtered_sentence)
我將向您展示一些示例首先我從數據框( twitter_df
)中提取文本數據以進一步處理如下
from nltk.tokenize import word_tokenize
tweetText = twitter_df['text']
然后標記我使用以下方法
from nltk.tokenize import word_tokenize
tweetText = tweetText.apply(word_tokenize)
然后,要刪除停用詞,
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
tweetText = tweetText.apply(lambda x:[word for word in x if word not in stop_words])
tweetText.head()
我認為這會幫助你
import sys
print ("enter the string from which you want to remove list of stop words")
userstring = input().split(" ")
list =["a","an","the","in"]
another_list = []
for x in userstring:
if x not in list: # comparing from the list and removing it
another_list.append(x) # it is also possible to use .remove
for x in another_list:
print(x,end=' ')
# 2) if you want to use .remove more preferred code
import sys
print ("enter the string from which you want to remove list of stop words")
userstring = input().split(" ")
list =["a","an","the","in"]
another_list = []
for x in userstring:
if x in list:
userstring.remove(x)
for x in userstring:
print(x,end = ' ')
#the code will be like this
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.