简体   繁体   中英

Python VADER lexicon Structure for sentiment analysis

I am using the VADER sentiment lexicon in Python's nltk library to analyze text sentiment. This lexicon does not suit my domain well, and so I wanted to add my own sentiment scores to various words. So, I got my hands on the lexicon text file (vader_lexicon.txt) to do just that. However, I do not understand the architecture of this file well. For example, a word like obliterate will have the following data in the text file: obliterate -2.9 0.83066 [-3, -4, -3, -3, -3, -3, -2, -1, -4, -3]

Clearly the -2.9 is the average of sentiment scores in the list. But what does the 0.83066 represent?

Thanks!

According to the VADER source code , only the first number on each line is used. The rest of the line is ignored:

for line in self.lexicon_full_filepath.split('\n'):
    (word, measure) = line.strip().split('\t')[0:2] # Here!
    lex_dict[word] = float(measure)

The vader_lexicon.txt file has four tab delimited columns as you said.

  1. Column 1: The Token
  2. Column 2: It is the Mean of the human Sentiment ratings
  3. Column 3: It is the Standard Deviation of the token assuming it follows Normal Distribution
  4. Column 4: It is the list of 10 human ratings taken during experiments

The actual code or sentiment calculation does not use the 3rd and 4th columns. So if you want to update the lexicon according to your requirement you can leave the last two columns blank or fill in with a random number and a list.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM