简体   繁体   English

Twitter情绪分析

[英]Twitter sentiment analysis on a string

I've written a program that takes a twitter data that contains tweets and labels ( 0 for neutral sentiment and 1 for negative sentiment) and predicts which category the tweet belongs to. 我编写了一个程序,该程序接收包含推文和标签的Twitter数据( 0表示中立情绪, 1表示否定情绪)并预测该推文所属的类别。 The program works well on the training and test Set. 该程序在训练和测试集上效果很好。 However I'm having problem in applying prediction function with a string. 但是我在对字符串应用预测函数时遇到问题。 I'm not sure how to do that. 我不确定该怎么做。

I have tried cleaning the string the way I cleaned the dataset before calling the predict function but the values returned are in wrong shape. 我曾尝试在调用预报函数之前以清理数据集的方式清理字符串,但返回的值格式错误。

import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
import re

#Loading dataset
dataset = pd.read_csv('tweet.csv')

#List to hold cleaned tweets
clean_tweet = []

#Cleaning tweets
for i in range(len(dataset)):
    tweet = re.sub('[^a-zA-Z]', ' ', dataset['tweet'][i])
    tweet = re.sub('@[\w]*',' ',dataset['tweet'][i])
    tweet = tweet.lower()
    tweet = tweet.split()
    tweet = [ps.stem(token) for token in tweet if not token in set(stopwords.words('english'))]
    tweet = ' '.join(tweet)
    clean_tweet.append(tweet)

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 3000)
X = cv.fit_transform(clean_tweet)
X =  X.toarray()
y = dataset.iloc[:, 1].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

from sklearn.naive_bayes import GaussianNB
n_b = GaussianNB()
n_b.fit(X_train, y_train)
y_pred  = n_b.predict(X_test) 

some_tweet = "this is a mean tweet"  # How to apply predict function to this string

Use cv.transform([cleaned_new_tweet]) on your new string to transform your new Tweet to your existing document-term matrix. 在新字符串上使用cv.transform([cleaned_new_tweet])将新的Tweet转换为现有的文档术语矩阵。 That will return the Tweet in the correct shape. 这将以正确的形状返回推文。

tl;dr TL;博士

.predict() expects a list of strings . .predict()需要一个strings list So you need to add some_tweet to a list . 因此,您需要将some_tweet添加到list Eg new_tweet = ["this is a mean tweet"] 例如new_tweet = ["this is a mean tweet"]

Your code 您的密码

You had some issues in your code that I tried fixing for you... 您的代码中有一些问题,我曾尝试为您解决...

import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
import re

#Loading dataset
dataset = pd.read_csv('tweet.csv')


# Define cleaning function
# You can define it once as a function so it can be easily re-used else where
def clean_tweet(tweet: str):
    tweet = re.sub('[^a-zA-Z]', ' ', dataset['tweet'][i])
    tweet = re.sub('@[\w]*', ' ', tweet) #BUG: you need to pass the tweet you modified here instead of the original tweet again
    tweet = tweet.lower()
    tweet = tweet.split()
    tweet = [ps.stem(token) for token in tweet if not token in set(stopwords.words('english'))]
    tweet = ' '.join(tweet)
    return tweet

#List to hold cleaned tweets and labels
X = [clean_tweet(tweet) for tweet in dataset['tweet']] # you can create your X directly with your new function
y = dataset.iloc[:, 1].values

# Define a single model
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline

# Use Pipeline as your classifier, this way you don't need to keep calling a transform and fit all the time.
classifier = Pipeline(
    [
        ('cv', CountVectorizer(max_features=300)),
        ('n_b', GaussianNB())
    ]
)


# Before you trained your CountVectorizer BEFORE splitting into train/test. That is a biiig mistake.
# First you split to train/split and then you train all the steps of your model.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Here you train all steps of your Pipeline in one go.
classifier.fit(X_train, y_train)
y_pred  = classifier.predict(X_test)


# Predicting new tweets
some_tweet = "this is a mean tweet"
some_tweet = clean_tweet(some_tweet) # re-use your clean function
predicted = classifier.predict([some_tweet]) # put the tweet inside a list!!!! 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM