简体   繁体   中英

Writing Vader sentiment analysis to new column in csv

I have a scraped csv file of trip adviser reviews. There are four columns:

person, title, rating, review, review_date.

I want this code to do the following:

  1. In the csv, create a new column called "tarate".
  2. Populate "tarate" with 'pos', 'neg', or 'neut'. It should read the numeric values in "rating". "tarate" == 'pos' if "rating" >=40; "tarate' == 'neut' if "rating" == 30; "tarate" == 'neg' if "rating"<30.
  3. Next, run the "review" column through SentimentIntensityAnalyzer.
  4. Record the output in a new csv column called "scores"
  5. Create a separate csv column for the "compound" values, using a 'pos' and 'neg' classification
  6. Run the sklearn.metrics tool to compare the trip adviser ratings ("tarate") to "compound". This can just print.

Part of the code is based on [http://akashsenta.com/blog/sentiment-analysis-with-vader-with-python/]

Here is my csv file: [https://github.com/nsusmann/vadersentiment]

I am getting some errors. I am a beginner and I think I am getting tripped up on things like pointing to specific columns and also the lambda function.

Here is the code:

# open command prompt
# import nltk
# nltk.download()
# pip3 install pandas
# pip3 installs sci-kitlearn
# pip3 install matplotlib
# pip3 install seaborn
# pip3 install vaderSentiment
#pip3 install openpyxl

import pandas
import nltk
nltk.download([
    "vader_lexicon",
    "stopwords"])
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.sentiment.util import *
from nltk import tokenize
from nltk.corpus import stopwords
import pandas as pd
import numpy as np
from collections import Counter
import re
import math
import html
import sklearn
import sklearn.metrics as metrics
from sklearn.metrics import mutual_info_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
import matplotlib.pyplot as plt
import seaborn as sns
from pprint import pprint
import openpyxl

# open the file to save the review
import csv
outputfile = open('D:\Documents\Archaeology\Projects\Patmos\TextAnalysis\Sentiment\scraped_cln_sent.csv', 'w', newline='')
df = csv.writer(outputfile)

#open Vader Sentiment Analyzer 
from nltk.sentiment.vader import SentimentIntensityAnalyzer

#make SIA into an object
analyzer = SentimentIntensityAnalyzer()

#create a new column called "tarate"
df['tarate'],
#populate column "tarate". write pos, neut, or neg per row based on column "rating" value
df.loc[df['rating'] >= 40, ['tarate'] == 'Pos',
df.loc[df['rating'] == 30, ['tarate'] == 'Neut',
df.loc[df['rating'] <= 20, ['tarate'] == 'Neg', 

#use polarity_scores() to get sentiment metrics and write them to new column "scores"
df.head['scores'] == df['review'].apply(lambda review: sid.polarity_scores['review'])

#extract the compound value of the polarity scores and write to new column "compound"
df['compound'] = df['scores'].apply(lambda d:d['compound'])

#using column "compound", determine whether the score is <0> and write new column "score" recording positive or negative
df['score'] = df['compound'].apply(lambda score: 'pos' if score >=0 else 'neg')
ta.file()
                                           
#get accuracy metrics. this will compare the trip advisor rating (text version recorded in column "tarate") to the sentiment analysis results in column "score"
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix                                           
accuracy_score(df['tarate'],df['score'])

print(classification_report(df['tarate'],df['score']))     ```

You don't need to create the new column before filling it. Also, you have spurious commas at the ends of lines. Do not do that; a comma and the end of an expression in Python turns it into a tuple. Also remember that = is the assignment operator, and == is a comparison.

The pandas "loc" function takes a row indexer, and a column indexer:

#populate column "tarate". write pos, neut, or neg per row based on column "rating" value
df.loc[df['rating'] >= 40, 'tarate'] = 'Pos'
df.loc[df['rating'] == 30, 'tarate'] = 'Neut'
df.loc[df['rating'] <= 20, 'tarate'] = 'Net'

Note that this will leave NaN (not a number) in the column for values between 20 and 30, and values between 30 and 40.

I can't tell what you are trying to do here, but this isn't right:

#extract the compound value of the polarity scores and write to new column "compound"
df['compound'] = df['scores'].apply(lambda d:d['compound'])

df['scores'] is not going to contain a column called "compound", which is what you're asking for in the lambda.

I recommend looking up list comprehensions, google "pandas apply method", and "pandas lambda examples" to get more familiar with them.

For a bit of example code:

import pandas as pd

#create a demo dataframe called 'df'
df = pd.DataFrame({'rating': [12, 42, 40, 30, 31, 56, 8, 88, 39, 79]})

This gives you a dataframe that looks like this (just one column called 'rating' with integer numbers in it):

   rating
0      12
1      42
2      40
3      30
4      31
5      56
6       8
7      88
8      39
9      79

Using that column to make another based on the values in it can be done like this...

#create a new column called 'tarate' and using a list comprehension
#assign a string value of either 'pos', 'neut', or 'neg' based on the 
#numeric value in the 'rating' column (it does this going row by row)
df['tarate'] = ['pos' if x >= 40 else 'neut' if x == 30 else 'neg' for x in df['rating']]

#output the df
print(df)

Outputs:

   rating tarate
0      12    neg
1      42    pos
2      40    pos
3      30   neut
4      31    neg
5      56    pos
6       8    neg
7      88    pos
8      39    neg
9      79    pos

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM