如何使用 nltk 或 python 刪除停用詞

Question

我有一個數據集，我想從中刪除停用詞。

我使用 NLTK 獲取停用詞列表：

from nltk.corpus import stopwords

stopwords.words('english')

究竟如何將數據與停用詞列表進行比較，從而從數據中刪除停用詞？

Answer 1

from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]

Answer 2

您還可以設置差異，例如：

list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))

Answer 3

要排除所有類型的停用詞，包括 nltk 停用詞，您可以執行以下操作：

from stop_words import get_stop_words
from nltk.corpus import stopwords

stop_words = list(get_stop_words('en'))         #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)

output = [w for w in word_list if not w in stop_words]

Answer 4

我想您有一個要從中刪除停用詞的單詞列表 (word_list)。 你可以這樣做：

filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
  if word in stopwords.words('english'): 
    filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword

Answer 5

為此，有一個非常簡單的輕量級 python 包stop-words 。

首先使用以下命令安裝軟件包： pip install stop-words

然后你可以使用列表理解在一行中刪除你的單詞：

from stop_words import get_stop_words

filtered_words = [word for word in dataset if word not in get_stop_words('english')]

這個包下載非常輕量級（與 nltk 不同），適用於Python 2和Python 3 ，並且它有許多其他語言的停用詞，例如：

    Arabic
    Bulgarian
    Catalan
    Czech
    Danish
    Dutch
    English
    Finnish
    French
    German
    Hungarian
    Indonesian
    Italian
    Norwegian
    Polish
    Portuguese
    Romanian
    Russian
    Spanish
    Swedish
    Turkish
    Ukrainian

Answer 6

使用textcleaner庫從數據中刪除停用詞。

按照此鏈接： https : //yugantm.github.io/textcleaner/documentation.html#remove_stpwrds

請按照以下步驟使用此庫執行此操作。

pip install textcleaner

安裝后：

import textcleaner as tc
data = tc.document(<file_name>) 
#you can also pass list of sentences to the document class constructor.
data.remove_stpwrds() #inplace is set to False by default

使用上面的代碼刪除停用詞。

Answer 7

這是我對此的看法，以防您想立即將答案轉換為字符串（而不是過濾詞列表）：

STOPWORDS = set(stopwords.words('english'))
text =  ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text

Answer 8

你可以使用這個功能，你應該注意到你需要降低所有的單詞

from nltk.corpus import stopwords

def remove_stopwords(word_list):
        processed_word_list = []
        for word in word_list:
            word = word.lower() # in case they arenet all lower cased
            if word not in stopwords.words("english"):
                processed_word_list.append(word)
        return processed_word_list

Answer 9

使用過濾器：

from nltk.corpus import stopwords
# ...  
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list))

Answer 10

雖然問題有點老，但這里有一個新庫，值得一提，可以做額外的任務。

在某些情況下，您不想只刪除停用詞。 相反，您可能希望在文本數據中找到停用詞並將其存儲在列表中，以便您可以找到數據中的噪音並使其更具交互性。

該庫稱為'textfeatures' 。 您可以按如下方式使用它：

! pip install textfeatures
import textfeatures as tf
import pandas as pd

例如，假設您有以下一組字符串：

texts = [
    "blue car and blue window",
    "black crow in the window",
    "i see my reflection in the window"]

df = pd.DataFrame(texts) # Convert to a dataframe
df.columns = ['text'] # give a name to the column
df

現在，調用 stopwords() 函數並傳遞您想要的參數：

tf.stopwords(df,"text","stopwords") # extract stop words
df[["text","stopwords"]].head() # give names to columns

結果將是：

    text                                 stopwords
0   blue car and blue window             [and]
1   black crow in the window             [in, the]
2   i see my reflection in the window    [i, my, in, the]

如您所見，最后一列包含該文檔（記錄）中包含的停用詞。

Answer 11

如果您的數據存儲為Pandas DataFrame ，您可以使用remove_stopwords的 remove_stopwords，默認情況下使用 NLTK 停用詞列表。

import pandas as pd
import texthero as hero
df['text_without_stopwords'] = hero.remove_stopwords(df['text'])

Answer 12

from nltk.corpus import stopwords 

from nltk.tokenize import word_tokenize 

example_sent = "This is a sample sentence, showing off the stop words filtration."

  
stop_words = set(stopwords.words('english')) 
  
word_tokens = word_tokenize(example_sent) 
  
filtered_sentence = [w for w in word_tokens if not w in stop_words] 
  
filtered_sentence = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 
  
print(word_tokens) 
print(filtered_sentence)

Answer 13

我將向您展示一些示例首先我從數據框（ twitter_df ）中提取文本數據以進一步處理如下

     from nltk.tokenize import word_tokenize
     tweetText = twitter_df['text']

然后標記我使用以下方法

     from nltk.tokenize import word_tokenize
     tweetText = tweetText.apply(word_tokenize)

然后，要刪除停用詞，

     from nltk.corpus import stopwords
     nltk.download('stopwords')

     stop_words = set(stopwords.words('english'))
     tweetText = tweetText.apply(lambda x:[word for word in x if word not in stop_words])
     tweetText.head()

我認為這會幫助你

Answer 14

   import sys
print ("enter the string from which you want to remove list of stop words")
userstring = input().split(" ")
list =["a","an","the","in"]
another_list = []
for x in userstring:
    if x not in list:           # comparing from the list and removing it
        another_list.append(x)  # it is also possible to use .remove
for x in another_list:
     print(x,end=' ')

   # 2) if you want to use .remove more preferred code
    import sys
    print ("enter the string from which you want to remove list of stop words")
    userstring = input().split(" ")
    list =["a","an","the","in"]
    another_list = []
    for x in userstring:
        if x in list:           
            userstring.remove(x)  
    for x in userstring:           
        print(x,end = ' ') 
    #the code will be like this

如何使用 nltk 或 python 刪除停用詞

問題描述

13 個解決方案

解決方案1
223 2011-03-30 12:53:40

解決方案2
19 2012-03-26 22:25:10

解決方案3
17 2017-10-27 14:31:34

解決方案4
14 2011-03-30 12:51:52

解決方案5
8 2019-09-22 12:13:12

解決方案6
5 2019-02-12 12:30:08

解決方案7
4 2020-02-08 21:01:06

解決方案8
2 2017-06-13 15:48:12

解決方案9
2 2017-10-02 02:55:39

解決方案10
1 2021-02-24 12:55:52

解決方案11
0 2020-06-02 06:58:10

解決方案12
0 2020-07-05 08:27:14

解決方案13
0 2020-10-13 05:28:27

解決方案14
-3 2017-03-18 21:04:22

如何使用 nltk 或 python 刪除停用詞

問題描述

13 個解決方案

解決方案1 223 2011-03-30 12:53:40

解決方案2 19 2012-03-26 22:25:10

解決方案3 17 2017-10-27 14:31:34

解決方案4 14 2011-03-30 12:51:52

解決方案5 8 2019-09-22 12:13:12

解決方案6 5 2019-02-12 12:30:08

解決方案7 4 2020-02-08 21:01:06

解決方案8 2 2017-06-13 15:48:12

解決方案9 2 2017-10-02 02:55:39

解決方案10 1 2021-02-24 12:55:52

解決方案11 0 2020-06-02 06:58:10

解決方案12 0 2020-07-05 08:27:14

解決方案13 0 2020-10-13 05:28:27

解決方案14 -3 2017-03-18 21:04:22

解決方案1
223 2011-03-30 12:53:40

解決方案2
19 2012-03-26 22:25:10

解決方案3
17 2017-10-27 14:31:34

解決方案4
14 2011-03-30 12:51:52

解決方案5
8 2019-09-22 12:13:12

解決方案6
5 2019-02-12 12:30:08

解決方案7
4 2020-02-08 21:01:06

解決方案8
2 2017-06-13 15:48:12

解決方案9
2 2017-10-02 02:55:39

解決方案10
1 2021-02-24 12:55:52

解決方案11
0 2020-06-02 06:58:10

解決方案12
0 2020-07-05 08:27:14

解決方案13
0 2020-10-13 05:28:27

解決方案14
-3 2017-03-18 21:04:22