如何使用 python 刪除標點符號和停用詞

Question

我有一個 python function 必須刪除標點符號和停用詞，但是當我打印結果時它不會刪除它。

我的 function 中的錯誤在哪里？

代碼：

    from nltk.corpus import stopwords
    from string import punctuation
        ppt = '''...!@#$%^&*(){}[]|._-`/?:;"'\,~12345678876543''' 
    
        def text_process(raw_text):
            '''
            parameters:
            =========
            raw_text: text as input
            functions:
            ==========
            - remove all punctuation
            - remove all stop words
            - return a list of the cleaned text
        
            '''
            #check characters to see if they are in punctuation
            nopunc = [char for char in list(raw_text)if char not in ppt]
          
            # join the characters again to form the string
            nopunc = "".join(nopunc)
            
            #now just remove ant stopwords
            return [word for word in nopunc.lower().split() if word.lower() not in stopwords.words("english")]

def_test_twtr_preds["tokens"] = def_test_twtr_preds["processed_TEXT"].apply(text_process)


#get most common words in  dataset
all_words = []
for line in list(def_test_twtr_preds["processed_TEXT"]):
    words = line.split()
    for word in words:
        all_words.append(word.lower())
print("Most common words:\n{}".format(Counter(all_words).most_common(10)))

當我顯示數據集中存在的最常見單詞的結果是：

Most common words:
[('the', 281), ('and', 103), ('words', 81), ('…', 70), ('are', 61), ('word', 57), ('for', 55), ('you', 48), ('this', 40), ('.', 34)]

Answer 1

注意list('your text')將導致['y','o','u','r','t','e','x','t']而不是['your', 'text'] 。

您可以使用nopunc = [w for w in text_raw.split() if w.isalpha()]刪除標點符號

然而，上面的代碼也將刪除I'm in in I'm fine這個詞。 所以如果你想得到['I','m','fine'] ，你可以使用下面的代碼：

tokenizer = nltk.RegexpTokenizer(r"\w+")
nopunc = tokenizer.tokenize(raw_text)

如何使用 python 刪除標點符號和停用詞

問題描述

1 個解決方案

解決方案1
1 2020-07-07 10:48:40

如何使用 python 刪除標點符號和停用詞

問題描述

1 個解決方案

解決方案1 1 2020-07-07 10:48:40

解決方案1
1 2020-07-07 10:48:40