TypeError: string indices must be integers (Text Data Preprocessing in CSV files for Sentiment Analysis)

Question

I'm kind of new to programming and NLP in general. I've found some code on this website :( https://towardsdatascience.com/creating-the-twitter-sentiment-analysis-program-in-python-with-naive-bayes-classification-672e5589a7ed ) to use for sentiment analysis on twitter. I have the csv files i need and so instead of building them i just defined the variables by the files.

When i try to run the code it's giving me a type error when running this line:

preprocessedTrainingSet = tweetProcessor.processTweets(trainingData)

And traces back to the line:

processedTweets.append((self._processTweet(tweet["text"]),tweet["label"])).

I don't know how to circumvent the issue and still keep core functionality of the code intact.

import pandas as pd
import re
from nltk.tokenize import word_tokenize
from string import punctuation 
from nltk.corpus import stopwords 
import twitter
import csv
import time
import nltk
nltk.download('stopwords')

testDataSet = pd.read_csv("Twitter data.csv")
print(testDataSet[0:4])
trainingData = pd.read_csv("full-corpus.csv")
print(trainingData[0:4])


class PreProcessTweets:
    def __init__(self):
        self._stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])

    def processTweets(self, list_of_tweets):
        processedTweets=[]
        for tweet in list_of_tweets:
            processedTweets.append((self._processTweet(tweet["text"]),tweet["label"]))
        return processedTweets

    def _processTweet(self, tweet):
        tweet = tweet.lower() # convert text to lower-case
        tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
        tweet = re.sub('@[^\s]+', 'AT_USER', tweet) # remove usernames
        tweet = re.sub(r'#([^\s]+)', r'\1', tweet) # remove the # in #hashtag
        tweet = word_tokenize(tweet) # remove repeated characters (helloooooooo into hello)
        return [word for word in tweet if word not in self._stopwords]

tweetProcessor = PreProcessTweets()
preprocessedTrainingSet = tweetProcessor.processTweets(trainingData)
preprocessedTestSet = tweetProcessor.processTweets(testDataSet)

I expect it to start cleaning the data I've found before I can start using Naive Bayes

Answer 1

It's hard to tell without your actual data, but I think you are confusing multiple types through each other.

When loading the csv-data you are making a pandas dataframe.
Then in the processTweets method, you are trying to loop through this dataframe like a list.
At last, in the for loop of the processTweets where you are accessing the values of the list, which you call 'tweet', you are trying to access the values of 'tweet' with the keys 'text' and 'label'. I however doubt that you have a dictionary in there.

I downloaded some tweets from this site . With this data, I tested your code and made the following adjustments.

import pandas as pd
import re
from nltk.tokenize import word_tokenize
from string import punctuation
from nltk.corpus import stopwords
import nltk

#had to install 'punkt'
nltk.download('punkt')
nltk.download('stopwords')
testDataSet = pd.read_csv("data.csv")

# For testing if the code works I only used a TestDatasSet, and no trainingData.


class PreProcessTweets:
    def __init__(self):
        self._stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])

    # To make it clear I changed the parameter to df_of_tweets (df = dataframe)
    def processTweets(self, df_of_tweets):

        processedTweets=[]

        #turning the dataframe into lists
        # in my data I did not have a label, so I used sentiment instead.
        list_of_tweets = df_of_tweets.text.tolist()
        list_of_sentiment = df_of_tweets.sentiment.tolist()

        # using enumerate to keep track of the index of the tweets so I can use it to index the list of sentiment
        for index, tweet in enumerate(list_of_tweets):
            # adjusted the code here so that it takes values of the lists straight away.
            processedTweets.append((self._processTweet(tweet), list_of_sentiment[index]))
        return processedTweets

    def _processTweet(self, tweet):
        tweet = tweet.lower() # convert text to lower-case
        tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
        tweet = re.sub('@[^\s]+', 'AT_USER', tweet) # remove usernames
        tweet = re.sub(r'#([^\s]+)', r'\1', tweet) # remove the # in #hashtag
        tweet = word_tokenize(tweet) # remove repeated characters (helloooooooo into hello)
        return [word for word in tweet if word not in self._stopwords]


tweetProcessor = PreProcessTweets()
preprocessedTestSet = tweetProcessor.processTweets(testDataSet)
tweetProcessor = PreProcessTweets()
print(preprocessedTestSet)

Hope it helps!

TypeError: string indices must be integers (Text Data Preprocessing in CSV files for Sentiment Analysis)

Question

1 answers

solution1
0 2019-05-16 09:46:07

TypeError: string indices must be integers (Text Data Preprocessing in CSV files for Sentiment Analysis)

Question

1 answers

solution1 0 2019-05-16 09:46:07

solution1
0 2019-05-16 09:46:07