在文本情感分析中刪除textblob中的停用詞的有效方法是什么？

Question

我正在嘗試使用朴素貝葉斯算法進行新聞報紙頭條新聞的情緒分析。 我正在使用TextBlob用於此目的，我發現很難刪除諸如'a'，'the'，'in'之類的停用詞。下面是我在python中的代碼片段：

from textblob.classifiers import NaiveBayesClassifier
from textblob import TextBlob

test = [
("11 bonded labourers saved from shoe firm", "pos"),
("Scientists greet Abdul Kalam after the successful launch of Agni on May 22, 1989","pos"),
("Heavy Winter Snow Storm Lashes Out In Northeast US", "neg"),
("Apparent Strike On Gaza Tunnels Kills 2 Palestinians", "neg")
       ]

with open('input.json', 'r') as fp:
cl = NaiveBayesClassifier(fp, format="json")

print(cl.classify("Oil ends year with biggest gain since 2009"))  # "pos"
print(cl.classify("25 dead in Baghdad blasts"))  # "neg"

Answer 1

您可以先加載json，然后使用替換創建元組列表（text，label）。

示范：

假設input.json文件是這樣的：

[
    {"text": "I love this sandwich.", "label": "pos"},
    {"text": "This is an amazing place!", "label": "pos"},
    {"text": "I do not like this restaurant", "label": "neg"}
]

然后你可以使用：

from textblob.classifiers import NaiveBayesClassifier
import json

train_list = []
with open('input.json', 'r') as fp:
    json_data = json.load(fp)
    for line in json_data:
        text = line['text']
        text = text.replace(" is ", " ") # you can remove multiple stop words
        label = line['label']
        train_list.append((text, label))
    cl = NaiveBayesClassifier(train_list)

from pprint import pprint
pprint(train_list)

輸出：

[(u'I love this sandwich.', u'pos'),
 (u'This an amazing place!', u'pos'),
 (u'I do not like this restaurant', u'neg')]

Answer 2

以下是刪除文本中的停用詞的代碼。 將所有停用詞放在停用詞文件中，然后讀取單詞並存儲到stop_words變量中。


# This function reads a file and returns its contents as an array
def readFileandReturnAnArray(fileName, readMode, isLower):
    myArray=[]
    with open(fileName, readMode) as readHandle:
        for line in readHandle.readlines():
            lineRead = line
            if isLower:
                lineRead = lineRead.lower()
            myArray.append(lineRead.strip().lstrip())
    readHandle.close()
    return myArray

stop_words = readFileandReturnAnArray("stopwords","r",True)

def removeItemsInTweetContainedInAList(tweet_text,stop_words,splitBy):
    wordsArray = tweet_text.split(splitBy)
    StopWords = list(set(wordsArray).intersection(set(stop_words)))
    return_str=""
    for word in wordsArray:
        if word not in StopWords:
            return_str += word + splitBy
    return return_str.strip().lstrip()


# Call the above method
tweet_text = removeItemsInTweetContainedInAList(tweet_text.strip().lstrip(),stop_words, " ")

在文本情感分析中刪除textblob中的停用詞的有效方法是什么？

問題描述

2 個解決方案

解決方案1
0 已采納 2017-02-20 18:42:58

解決方案2
0 2019-03-15 08:26:04

在文本情感分析中刪除textblob中的停用詞的有效方法是什么？

問題描述

2 個解決方案

解決方案1 0 已采納 2017-02-20 18:42:58

解決方案2 0 2019-03-15 08:26:04

解決方案1
0 已采納 2017-02-20 18:42:58

解決方案2
0 2019-03-15 08:26:04