[英]TypeError: string indices must be integers (Text Data Preprocessing in CSV files for Sentiment Analysis)
I'm kind of new to programming and NLP in general. 我对编程和NLP有点新意。 I've found some code on this website :( https://towardsdatascience.com/creating-the-twitter-sentiment-analysis-program-in-python-with-naive-bayes-classification-672e5589a7ed ) to use for sentiment analysis on twitter.
我在这个网站上找到了一些代码:( https://towardsdatascience.com/creating-the-twitter-sentiment-analysis-program-in-python-with-naive-bayes-classification-672e5589a7ed )用于情绪分析在推特上。 I have the csv files i need and so instead of building them i just defined the variables by the files.
我有我需要的csv文件,因此我只是通过文件定义变量而不是构建它们。
When i try to run the code it's giving me a type error when running this line: 当我尝试运行代码时,它在运行此行时给出了类型错误:
preprocessedTrainingSet = tweetProcessor.processTweets(trainingData) preprocessedTrainingSet = tweetProcessor.processTweets(trainingData)
And traces back to the line: 并追溯到这条线:
processedTweets.append((self._processTweet(tweet["text"]),tweet["label"])). processedTweets.append((self._processTweet(鸣叫[ “文本”]),鸣叫[ “标签”]))。
I don't know how to circumvent the issue and still keep core functionality of the code intact. 我不知道如何规避问题,仍然保持代码的核心功能完好无损。
import pandas as pd
import re
from nltk.tokenize import word_tokenize
from string import punctuation
from nltk.corpus import stopwords
import twitter
import csv
import time
import nltk
nltk.download('stopwords')
testDataSet = pd.read_csv("Twitter data.csv")
print(testDataSet[0:4])
trainingData = pd.read_csv("full-corpus.csv")
print(trainingData[0:4])
class PreProcessTweets:
def __init__(self):
self._stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])
def processTweets(self, list_of_tweets):
processedTweets=[]
for tweet in list_of_tweets:
processedTweets.append((self._processTweet(tweet["text"]),tweet["label"]))
return processedTweets
def _processTweet(self, tweet):
tweet = tweet.lower() # convert text to lower-case
tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
tweet = re.sub('@[^\s]+', 'AT_USER', tweet) # remove usernames
tweet = re.sub(r'#([^\s]+)', r'\1', tweet) # remove the # in #hashtag
tweet = word_tokenize(tweet) # remove repeated characters (helloooooooo into hello)
return [word for word in tweet if word not in self._stopwords]
tweetProcessor = PreProcessTweets()
preprocessedTrainingSet = tweetProcessor.processTweets(trainingData)
preprocessedTestSet = tweetProcessor.processTweets(testDataSet)
I expect it to start cleaning the data I've found before I can start using Naive Bayes 我希望它能开始清理我在开始使用朴素贝叶斯之前找到的数据
It's hard to tell without your actual data, but I think you are confusing multiple types through each other. 没有你的实际数据很难说,但我认为你们互相混淆了多种类型。
I downloaded some tweets from this site . 我从这个网站下载了一些推文。 With this data, I tested your code and made the following adjustments.
通过这些数据,我测试了您的代码并进行了以下调整。
import pandas as pd
import re
from nltk.tokenize import word_tokenize
from string import punctuation
from nltk.corpus import stopwords
import nltk
#had to install 'punkt'
nltk.download('punkt')
nltk.download('stopwords')
testDataSet = pd.read_csv("data.csv")
# For testing if the code works I only used a TestDatasSet, and no trainingData.
class PreProcessTweets:
def __init__(self):
self._stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])
# To make it clear I changed the parameter to df_of_tweets (df = dataframe)
def processTweets(self, df_of_tweets):
processedTweets=[]
#turning the dataframe into lists
# in my data I did not have a label, so I used sentiment instead.
list_of_tweets = df_of_tweets.text.tolist()
list_of_sentiment = df_of_tweets.sentiment.tolist()
# using enumerate to keep track of the index of the tweets so I can use it to index the list of sentiment
for index, tweet in enumerate(list_of_tweets):
# adjusted the code here so that it takes values of the lists straight away.
processedTweets.append((self._processTweet(tweet), list_of_sentiment[index]))
return processedTweets
def _processTweet(self, tweet):
tweet = tweet.lower() # convert text to lower-case
tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
tweet = re.sub('@[^\s]+', 'AT_USER', tweet) # remove usernames
tweet = re.sub(r'#([^\s]+)', r'\1', tweet) # remove the # in #hashtag
tweet = word_tokenize(tweet) # remove repeated characters (helloooooooo into hello)
return [word for word in tweet if word not in self._stopwords]
tweetProcessor = PreProcessTweets()
preprocessedTestSet = tweetProcessor.processTweets(testDataSet)
tweetProcessor = PreProcessTweets()
print(preprocessedTestSet)
Hope it helps! 希望能帮助到你!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.