Twitter sentiment analysis on a string

Question

I've written a program that takes a twitter data that contains tweets and labels ( 0 for neutral sentiment and 1 for negative sentiment) and predicts which category the tweet belongs to. The program works well on the training and test Set. However I'm having problem in applying prediction function with a string. I'm not sure how to do that.

I have tried cleaning the string the way I cleaned the dataset before calling the predict function but the values returned are in wrong shape.

import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
import re

#Loading dataset
dataset = pd.read_csv('tweet.csv')

#List to hold cleaned tweets
clean_tweet = []

#Cleaning tweets
for i in range(len(dataset)):
    tweet = re.sub('[^a-zA-Z]', ' ', dataset['tweet'][i])
    tweet = re.sub('@[\w]*',' ',dataset['tweet'][i])
    tweet = tweet.lower()
    tweet = tweet.split()
    tweet = [ps.stem(token) for token in tweet if not token in set(stopwords.words('english'))]
    tweet = ' '.join(tweet)
    clean_tweet.append(tweet)

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 3000)
X = cv.fit_transform(clean_tweet)
X =  X.toarray()
y = dataset.iloc[:, 1].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

from sklearn.naive_bayes import GaussianNB
n_b = GaussianNB()
n_b.fit(X_train, y_train)
y_pred  = n_b.predict(X_test) 

some_tweet = "this is a mean tweet"  # How to apply predict function to this string

Answer 1

Use cv.transform([cleaned_new_tweet]) on your new string to transform your new Tweet to your existing document-term matrix. That will return the Tweet in the correct shape.

Answer 2

tl;dr

.predict() expects a list of strings . So you need to add some_tweet to a list . Eg new_tweet = ["this is a mean tweet"]

Your code

You had some issues in your code that I tried fixing for you...

import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
import re

#Loading dataset
dataset = pd.read_csv('tweet.csv')


# Define cleaning function
# You can define it once as a function so it can be easily re-used else where
def clean_tweet(tweet: str):
    tweet = re.sub('[^a-zA-Z]', ' ', dataset['tweet'][i])
    tweet = re.sub('@[\w]*', ' ', tweet) #BUG: you need to pass the tweet you modified here instead of the original tweet again
    tweet = tweet.lower()
    tweet = tweet.split()
    tweet = [ps.stem(token) for token in tweet if not token in set(stopwords.words('english'))]
    tweet = ' '.join(tweet)
    return tweet

#List to hold cleaned tweets and labels
X = [clean_tweet(tweet) for tweet in dataset['tweet']] # you can create your X directly with your new function
y = dataset.iloc[:, 1].values

# Define a single model
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline

# Use Pipeline as your classifier, this way you don't need to keep calling a transform and fit all the time.
classifier = Pipeline(
    [
        ('cv', CountVectorizer(max_features=300)),
        ('n_b', GaussianNB())
    ]
)


# Before you trained your CountVectorizer BEFORE splitting into train/test. That is a biiig mistake.
# First you split to train/split and then you train all the steps of your model.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Here you train all steps of your Pipeline in one go.
classifier.fit(X_train, y_train)
y_pred  = classifier.predict(X_test)


# Predicting new tweets
some_tweet = "this is a mean tweet"
some_tweet = clean_tweet(some_tweet) # re-use your clean function
predicted = classifier.predict([some_tweet]) # put the tweet inside a list!!!!

Twitter sentiment analysis on a string

Question

2 answers

solution1
2 ACCPTED 2019-07-03 14:33:08

solution2
2 2019-07-03 15:06:35

tl;dr

Your code

Twitter sentiment analysis on a string

Question

2 answers

solution1 2 ACCPTED 2019-07-03 14:33:08

solution2 2 2019-07-03 15:06:35

tl;dr

Your code

solution1
2 ACCPTED 2019-07-03 14:33:08

solution2
2 2019-07-03 15:06:35