简体   繁体   English

TypeError:字符串索引必须是整数(用于情感分析的CSV文件中的文本数据预处理)

[英]TypeError: string indices must be integers (Text Data Preprocessing in CSV files for Sentiment Analysis)

I'm kind of new to programming and NLP in general. 我对编程和NLP有点新意。 I've found some code on this website :( https://towardsdatascience.com/creating-the-twitter-sentiment-analysis-program-in-python-with-naive-bayes-classification-672e5589a7ed ) to use for sentiment analysis on twitter. 我在这个网站上找到了一些代码:( https://towardsdatascience.com/creating-the-twitter-sentiment-analysis-program-in-python-with-naive-bayes-classification-672e5589a7ed )用于情绪分析在推特上。 I have the csv files i need and so instead of building them i just defined the variables by the files. 我有我需要的csv文件,因此我只是通过文件定义变量而不是构建它们。

When i try to run the code it's giving me a type error when running this line: 当我尝试运行代码时,它在运行此行时给出了类型错误:

preprocessedTrainingSet = tweetProcessor.processTweets(trainingData) preprocessedTrainingSet = tweetProcessor.processTweets(trainingData)

And traces back to the line: 并追溯到这条线:

processedTweets.append((self._processTweet(tweet["text"]),tweet["label"])). processedTweets.append((self._processTweet(鸣叫[ “文本”]),鸣叫[ “标签”]))。

I don't know how to circumvent the issue and still keep core functionality of the code intact. 我不知道如何规避问题,仍然保持代码的核心功能完好无损。

import pandas as pd
import re
from nltk.tokenize import word_tokenize
from string import punctuation 
from nltk.corpus import stopwords 
import twitter
import csv
import time
import nltk
nltk.download('stopwords')

testDataSet = pd.read_csv("Twitter data.csv")
print(testDataSet[0:4])
trainingData = pd.read_csv("full-corpus.csv")
print(trainingData[0:4])


class PreProcessTweets:
    def __init__(self):
        self._stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])

    def processTweets(self, list_of_tweets):
        processedTweets=[]
        for tweet in list_of_tweets:
            processedTweets.append((self._processTweet(tweet["text"]),tweet["label"]))
        return processedTweets

    def _processTweet(self, tweet):
        tweet = tweet.lower() # convert text to lower-case
        tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
        tweet = re.sub('@[^\s]+', 'AT_USER', tweet) # remove usernames
        tweet = re.sub(r'#([^\s]+)', r'\1', tweet) # remove the # in #hashtag
        tweet = word_tokenize(tweet) # remove repeated characters (helloooooooo into hello)
        return [word for word in tweet if word not in self._stopwords]

tweetProcessor = PreProcessTweets()
preprocessedTrainingSet = tweetProcessor.processTweets(trainingData)
preprocessedTestSet = tweetProcessor.processTweets(testDataSet)

I expect it to start cleaning the data I've found before I can start using Naive Bayes 我希望它能开始清理我在开始使用朴素贝叶斯之前找到的数据

It's hard to tell without your actual data, but I think you are confusing multiple types through each other. 没有你的实际数据很难说,但我认为你们互相混淆了多种类型。

  1. When loading the csv-data you are making a pandas dataframe. 加载csv-data时,你正在制作一个pandas数据帧。
  2. Then in the processTweets method, you are trying to loop through this dataframe like a list. 然后在processTweets方法中,您尝试像列表一样遍历此数据框。
  3. At last, in the for loop of the processTweets where you are accessing the values of the list, which you call 'tweet', you are trying to access the values of 'tweet' with the keys 'text' and 'label'. 最后,在您访问列表值的processTweets的for循环中,您将其称为“tweet”,您尝试使用“text”和“label”键访问“tweet”的值。 I however doubt that you have a dictionary in there. 但是我怀疑你那里有一本字典。

I downloaded some tweets from this site . 我从这个网站下载了一些推文。 With this data, I tested your code and made the following adjustments. 通过这些数据,我测试了您的代码并进行了以下调整。

import pandas as pd
import re
from nltk.tokenize import word_tokenize
from string import punctuation
from nltk.corpus import stopwords
import nltk

#had to install 'punkt'
nltk.download('punkt')
nltk.download('stopwords')
testDataSet = pd.read_csv("data.csv")

# For testing if the code works I only used a TestDatasSet, and no trainingData.


class PreProcessTweets:
    def __init__(self):
        self._stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])

    # To make it clear I changed the parameter to df_of_tweets (df = dataframe)
    def processTweets(self, df_of_tweets):

        processedTweets=[]

        #turning the dataframe into lists
        # in my data I did not have a label, so I used sentiment instead.
        list_of_tweets = df_of_tweets.text.tolist()
        list_of_sentiment = df_of_tweets.sentiment.tolist()

        # using enumerate to keep track of the index of the tweets so I can use it to index the list of sentiment
        for index, tweet in enumerate(list_of_tweets):
            # adjusted the code here so that it takes values of the lists straight away.
            processedTweets.append((self._processTweet(tweet), list_of_sentiment[index]))
        return processedTweets

    def _processTweet(self, tweet):
        tweet = tweet.lower() # convert text to lower-case
        tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
        tweet = re.sub('@[^\s]+', 'AT_USER', tweet) # remove usernames
        tweet = re.sub(r'#([^\s]+)', r'\1', tweet) # remove the # in #hashtag
        tweet = word_tokenize(tweet) # remove repeated characters (helloooooooo into hello)
        return [word for word in tweet if word not in self._stopwords]


tweetProcessor = PreProcessTweets()
preprocessedTestSet = tweetProcessor.processTweets(testDataSet)
tweetProcessor = PreProcessTweets()
print(preprocessedTestSet)

Hope it helps! 希望能帮助到你!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM