简体   繁体   中英

Which is the efficient way to remove stop words in textblob for sentiment analysis of text?

I'm trying to implement Naive Bayes algorithm for sentiment analysis of News Paper headlines. I'm using TextBlob for this purpose and I'm finding it difficult to remove stop words such as 'a', 'the', 'in' etc. Below is the snippet of my code in python:

from textblob.classifiers import NaiveBayesClassifier
from textblob import TextBlob

test = [
("11 bonded labourers saved from shoe firm", "pos"),
("Scientists greet Abdul Kalam after the successful launch of Agni on May 22, 1989","pos"),
("Heavy Winter Snow Storm Lashes Out In Northeast US", "neg"),
("Apparent Strike On Gaza Tunnels Kills 2 Palestinians", "neg")
       ]

with open('input.json', 'r') as fp:
cl = NaiveBayesClassifier(fp, format="json")

print(cl.classify("Oil ends year with biggest gain since 2009"))  # "pos"
print(cl.classify("25 dead in Baghdad blasts"))  # "neg"

You can first load the json and then create list of tuples(text, label) with the replacement.

Demonstration:

Suppose the input.json file is something like this:

[
    {"text": "I love this sandwich.", "label": "pos"},
    {"text": "This is an amazing place!", "label": "pos"},
    {"text": "I do not like this restaurant", "label": "neg"}
]

Then you can use:

from textblob.classifiers import NaiveBayesClassifier
import json

train_list = []
with open('input.json', 'r') as fp:
    json_data = json.load(fp)
    for line in json_data:
        text = line['text']
        text = text.replace(" is ", " ") # you can remove multiple stop words
        label = line['label']
        train_list.append((text, label))
    cl = NaiveBayesClassifier(train_list)

from pprint import pprint
pprint(train_list)

output:

[(u'I love this sandwich.', u'pos'),
 (u'This an amazing place!', u'pos'),
 (u'I do not like this restaurant', u'neg')]

Following is the code to remove stopwords in the text. Place all the stopwords in the stopwords files, then read the words and store into stop_words variable.


# This function reads a file and returns its contents as an array
def readFileandReturnAnArray(fileName, readMode, isLower):
    myArray=[]
    with open(fileName, readMode) as readHandle:
        for line in readHandle.readlines():
            lineRead = line
            if isLower:
                lineRead = lineRead.lower()
            myArray.append(lineRead.strip().lstrip())
    readHandle.close()
    return myArray

stop_words = readFileandReturnAnArray("stopwords","r",True)

def removeItemsInTweetContainedInAList(tweet_text,stop_words,splitBy):
    wordsArray = tweet_text.split(splitBy)
    StopWords = list(set(wordsArray).intersection(set(stop_words)))
    return_str=""
    for word in wordsArray:
        if word not in StopWords:
            return_str += word + splitBy
    return return_str.strip().lstrip()


# Call the above method
tweet_text = removeItemsInTweetContainedInAList(tweet_text.strip().lstrip(),stop_words, " ")


The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM