簡體   English   中英

TypeError:字符串索引必須是整數(用於情感分析的CSV文件中的文本數據預處理)

[英]TypeError: string indices must be integers (Text Data Preprocessing in CSV files for Sentiment Analysis)

我對編程和NLP有點新意。 我在這個網站上找到了一些代碼:( https://towardsdatascience.com/creating-the-twitter-sentiment-analysis-program-in-python-with-naive-bayes-classification-672e5589a7ed )用於情緒分析在推特上。 我有我需要的csv文件,因此我只是通過文件定義變量而不是構建它們。

當我嘗試運行代碼時,它在運行此行時給出了類型錯誤:

preprocessedTrainingSet = tweetProcessor.processTweets(trainingData)

並追溯到這條線:

processedTweets.append((self._processTweet(鳴叫[ “文本”]),鳴叫[ “標簽”]))。

我不知道如何規避問題,仍然保持代碼的核心功能完好無損。

import pandas as pd
import re
from nltk.tokenize import word_tokenize
from string import punctuation 
from nltk.corpus import stopwords 
import twitter
import csv
import time
import nltk
nltk.download('stopwords')

testDataSet = pd.read_csv("Twitter data.csv")
print(testDataSet[0:4])
trainingData = pd.read_csv("full-corpus.csv")
print(trainingData[0:4])


class PreProcessTweets:
    def __init__(self):
        self._stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])

    def processTweets(self, list_of_tweets):
        processedTweets=[]
        for tweet in list_of_tweets:
            processedTweets.append((self._processTweet(tweet["text"]),tweet["label"]))
        return processedTweets

    def _processTweet(self, tweet):
        tweet = tweet.lower() # convert text to lower-case
        tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
        tweet = re.sub('@[^\s]+', 'AT_USER', tweet) # remove usernames
        tweet = re.sub(r'#([^\s]+)', r'\1', tweet) # remove the # in #hashtag
        tweet = word_tokenize(tweet) # remove repeated characters (helloooooooo into hello)
        return [word for word in tweet if word not in self._stopwords]

tweetProcessor = PreProcessTweets()
preprocessedTrainingSet = tweetProcessor.processTweets(trainingData)
preprocessedTestSet = tweetProcessor.processTweets(testDataSet)

我希望它能開始清理我在開始使用朴素貝葉斯之前找到的數據

沒有你的實際數據很難說,但我認為你們互相混淆了多種類型。

  1. 加載csv-data時,你正在制作一個pandas數據幀。
  2. 然后在processTweets方法中,您嘗試像列表一樣遍歷此數據框。
  3. 最后,在您訪問列表值的processTweets的for循環中,您將其稱為“tweet”,您嘗試使用“text”和“label”鍵訪問“tweet”的值。 但是我懷疑你那里有一本字典。

我從這個網站下載了一些推文。 通過這些數據,我測試了您的代碼並進行了以下調整。

import pandas as pd
import re
from nltk.tokenize import word_tokenize
from string import punctuation
from nltk.corpus import stopwords
import nltk

#had to install 'punkt'
nltk.download('punkt')
nltk.download('stopwords')
testDataSet = pd.read_csv("data.csv")

# For testing if the code works I only used a TestDatasSet, and no trainingData.


class PreProcessTweets:
    def __init__(self):
        self._stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])

    # To make it clear I changed the parameter to df_of_tweets (df = dataframe)
    def processTweets(self, df_of_tweets):

        processedTweets=[]

        #turning the dataframe into lists
        # in my data I did not have a label, so I used sentiment instead.
        list_of_tweets = df_of_tweets.text.tolist()
        list_of_sentiment = df_of_tweets.sentiment.tolist()

        # using enumerate to keep track of the index of the tweets so I can use it to index the list of sentiment
        for index, tweet in enumerate(list_of_tweets):
            # adjusted the code here so that it takes values of the lists straight away.
            processedTweets.append((self._processTweet(tweet), list_of_sentiment[index]))
        return processedTweets

    def _processTweet(self, tweet):
        tweet = tweet.lower() # convert text to lower-case
        tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
        tweet = re.sub('@[^\s]+', 'AT_USER', tweet) # remove usernames
        tweet = re.sub(r'#([^\s]+)', r'\1', tweet) # remove the # in #hashtag
        tweet = word_tokenize(tweet) # remove repeated characters (helloooooooo into hello)
        return [word for word in tweet if word not in self._stopwords]


tweetProcessor = PreProcessTweets()
preprocessedTestSet = tweetProcessor.processTweets(testDataSet)
tweetProcessor = PreProcessTweets()
print(preprocessedTestSet)

希望能幫助到你!

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM