简体   繁体   English

将 Vader 情绪分析写入 csv 中的新列

[英]Writing Vader sentiment analysis to new column in csv

I have a scraped csv file of trip adviser reviews.我有一个旅行顾问评论的刮取 csv 文件。 There are four columns:有四列:

person, title, rating, review, review_date.人物、标题、评级、评论、评论日期。

I want this code to do the following:我希望此代码执行以下操作:

  1. In the csv, create a new column called "tarate".在 csv 中,创建一个名为“tarate”的新列。
  2. Populate "tarate" with 'pos', 'neg', or 'neut'.用“pos”、“neg”或“neut”填充“tarate”。 It should read the numeric values in "rating".它应该读取“评级”中的数值。 "tarate" == 'pos' if "rating" >=40; "tarate" == 'pos' 如果 "rating" >=40; "tarate' == 'neut' if "rating" == 30; "tarate" == 'neg' if "rating"<30. “tarate” == 'neut' 如果“rating” == 30;“tarate” == 'neg' 如果“rating”<30。
  3. Next, run the "review" column through SentimentIntensityAnalyzer.接下来,通过 SentimentIntensityAnalyzer 运行“review”列。
  4. Record the output in a new csv column called "scores"在名为“scores”的新 csv 列中记录输出
  5. Create a separate csv column for the "compound" values, using a 'pos' and 'neg' classification使用“pos”和“neg”分类为“复合”值创建一个单独的 csv 列
  6. Run the sklearn.metrics tool to compare the trip adviser ratings ("tarate") to "compound".运行 sklearn.metrics 工具将旅行顾问评级(“tarate”)与“复合”进行比较。 This can just print.这个就可以打印了。

Part of the code is based on [http://akashsenta.com/blog/sentiment-analysis-with-vader-with-python/]部分代码基于[http://akashsenta.com/blog/sentiment-analysis-with-vader-with-python/]

Here is my csv file: [https://github.com/nsusmann/vadersentiment]这是我的 csv 文件:[https://github.com/nsusmann/vadersentiment]

I am getting some errors.我遇到了一些错误。 I am a beginner and I think I am getting tripped up on things like pointing to specific columns and also the lambda function.我是一个初学者,我想我被诸如指向特定列和 lambda 函数之类的东西绊倒了。

Here is the code:这是代码:

# open command prompt
# import nltk
# nltk.download()
# pip3 install pandas
# pip3 installs sci-kitlearn
# pip3 install matplotlib
# pip3 install seaborn
# pip3 install vaderSentiment
#pip3 install openpyxl

import pandas
import nltk
nltk.download([
    "vader_lexicon",
    "stopwords"])
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.sentiment.util import *
from nltk import tokenize
from nltk.corpus import stopwords
import pandas as pd
import numpy as np
from collections import Counter
import re
import math
import html
import sklearn
import sklearn.metrics as metrics
from sklearn.metrics import mutual_info_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
import matplotlib.pyplot as plt
import seaborn as sns
from pprint import pprint
import openpyxl

# open the file to save the review
import csv
outputfile = open('D:\Documents\Archaeology\Projects\Patmos\TextAnalysis\Sentiment\scraped_cln_sent.csv', 'w', newline='')
df = csv.writer(outputfile)

#open Vader Sentiment Analyzer 
from nltk.sentiment.vader import SentimentIntensityAnalyzer

#make SIA into an object
analyzer = SentimentIntensityAnalyzer()

#create a new column called "tarate"
df['tarate'],
#populate column "tarate". write pos, neut, or neg per row based on column "rating" value
df.loc[df['rating'] >= 40, ['tarate'] == 'Pos',
df.loc[df['rating'] == 30, ['tarate'] == 'Neut',
df.loc[df['rating'] <= 20, ['tarate'] == 'Neg', 

#use polarity_scores() to get sentiment metrics and write them to new column "scores"
df.head['scores'] == df['review'].apply(lambda review: sid.polarity_scores['review'])

#extract the compound value of the polarity scores and write to new column "compound"
df['compound'] = df['scores'].apply(lambda d:d['compound'])

#using column "compound", determine whether the score is <0> and write new column "score" recording positive or negative
df['score'] = df['compound'].apply(lambda score: 'pos' if score >=0 else 'neg')
ta.file()
                                           
#get accuracy metrics. this will compare the trip advisor rating (text version recorded in column "tarate") to the sentiment analysis results in column "score"
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix                                           
accuracy_score(df['tarate'],df['score'])

print(classification_report(df['tarate'],df['score']))     ```

You don't need to create the new column before filling it.您不需要在填充之前创建新列。 Also, you have spurious commas at the ends of lines.此外,您在行的末尾有虚假的逗号。 Do not do that;不要那样做; a comma and the end of an expression in Python turns it into a tuple.逗号和 Python 中表达式的结尾将其转换为元组。 Also remember that = is the assignment operator, and == is a comparison.还要记住=是赋值运算符,而==是比较。

The pandas "loc" function takes a row indexer, and a column indexer: pandas 的“loc”函数需要一个行索引器和一个列索引器:

#populate column "tarate". write pos, neut, or neg per row based on column "rating" value
df.loc[df['rating'] >= 40, 'tarate'] = 'Pos'
df.loc[df['rating'] == 30, 'tarate'] = 'Neut'
df.loc[df['rating'] <= 20, 'tarate'] = 'Net'

Note that this will leave NaN (not a number) in the column for values between 20 and 30, and values between 30 and 40.请注意,对于 20 到 30 之间的值以及 30 到 40 之间的值,这将在列中保留NaN (不是数字)。

I can't tell what you are trying to do here, but this isn't right:我不知道你想在这里做什么,但这是不对的:

#extract the compound value of the polarity scores and write to new column "compound"
df['compound'] = df['scores'].apply(lambda d:d['compound'])

df['scores'] is not going to contain a column called "compound", which is what you're asking for in the lambda. df['scores']不会包含名为“复合”的列,这是您在 lambda 中要求的。

I recommend looking up list comprehensions, google "pandas apply method", and "pandas lambda examples" to get more familiar with them.我建议查找列表推导式、谷歌“pandas apply method”和“pandas lambda examples”以更熟悉它们。

For a bit of example code:对于一些示例代码:

import pandas as pd

#create a demo dataframe called 'df'
df = pd.DataFrame({'rating': [12, 42, 40, 30, 31, 56, 8, 88, 39, 79]})

This gives you a dataframe that looks like this (just one column called 'rating' with integer numbers in it):这为您提供了一个如下所示的数据框(只有一个名为“评级”的列,其中包含整数):

   rating
0      12
1      42
2      40
3      30
4      31
5      56
6       8
7      88
8      39
9      79

Using that column to make another based on the values in it can be done like this...使用该列根据其中的值创建另一个列可以像这样完成...

#create a new column called 'tarate' and using a list comprehension
#assign a string value of either 'pos', 'neut', or 'neg' based on the 
#numeric value in the 'rating' column (it does this going row by row)
df['tarate'] = ['pos' if x >= 40 else 'neut' if x == 30 else 'neg' for x in df['rating']]

#output the df
print(df)

Outputs:输出:

   rating tarate
0      12    neg
1      42    pos
2      40    pos
3      30   neut
4      31    neg
5      56    pos
6       8    neg
7      88    pos
8      39    neg
9      79    pos

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM