[英]How to use my own lexicon dictionary to analyse sentences in R?
I have formed a new lexicon dictionary to analyse the sentiment of sentences in R. I have used lexicon dictionaries before using R, but I unsure how to use my own. 我已经形成了一个新的词典词典来分析R中句子的情感。在使用R之前我已经使用过词典词典,但是我不确定如何使用自己的词典。 I managed to create positive and negative list of words, which counts the number of positive and negative words, then providing a sum. 我设法创建了正面和负面的单词列表,该列表计算正面和负面单词的数量,然后提供一个总和。 This does not take into account the scores allocated to each word as shown in the example below. 如下例所示,这并未考虑分配给每个单词的分数。
I would like to analyse say this sentence "I am happy and kind of sad". 我想分析说这句话“我很高兴,有点伤心”。 Example list of words and scores (list would be bigger than this): 单词和分数列表示例(列表将比这个更大):
happy, 1.3455
sad, -1.0552
I would like to match these words with the sentence and take the sum of the scores, 1.3455 + -1.0552, which in this case gives an overall score of 0.2903. 我想将这些单词与句子匹配,并获得总分1.3455 + -1.0552,在这种情况下,总分为0.2903。
How would I go about in taking the actual score for each word to provide an overall score when analysing the sentiment of each sentence in R as emphasised in the example above? 如上例中所强调的,在分析R中每个句子的情感时,我将如何使用每个单词的实际分数来提供总体分数?
Many thanks, James 非常感谢,詹姆斯
You can start with the magnificent tidytext
package: 您可以从宏伟的tidytext
包开始:
library(tidytext)
library(tidyverse)
First, your data to analyze, and a small transformation: 首先,对您的数据进行分析,并进行一些小的转换:
# data
df <-data_frame(text = c('I am happy and kind of sad','sad is sad, happy is good'))
# add and ID
df <- tibble::rowid_to_column(df, "ID")
# add the name of the ID column
colnames(df)[1] <- "line"
> df
# A tibble: 1 x 2
line text
<int> <chr>
1 1 I am happy and kind of sad
Then you could work them to make words in column. 然后,您可以使他们在专栏中做单词。 This is a "loop" that is applied to each sentence (each id): 这是一个应用于每个句子(每个id)的“循环”:
tidy <- df %>% unnest_tokens(word, text)
> tidy
# A tibble: 7 x 2
line word
<int> <chr>
1 1 i
2 1 am
3 1 happy
4 1 and
5 1 kind
6 1 of
7 1 sad
Now your brand new lexicon: 现在您的全新词典:
lexicon <- data_frame(word =c('happy','sad'),scores=c(1.3455,-1.0552))
> lexicon
# A tibble: 2 x 2
word scores
<chr> <dbl>
1 happy 1.35
2 sad -1.06
Lastly, you can merge
lexicon and data to have the sum of the scores. 最后,您可以merge
词典和数据以得到分数的总和。
merged <- merge(tidy,lexicon, by = 'word')
Now for each phrase, the sentiment: 现在,对于每个短语,情绪:
scoredf <- aggregate(cbind(scores) ~line, data = merged, sum)
>scoredf
line scores
1 1 0.2903
2 2 -0.7649
Lastly you can merge
the initial df with the scores, to have phrases and scores together: 最后,您可以merge
初始df与乐谱merge
,将短语和乐谱合并在一起:
scoredf <- aggregate(cbind(scores) ~line, data = merged, sum)
merge(df,scoredf, by ='line')
line text scores
1 1 I am happy and kind of sad 0.2903
2 2 sad is sad, happy is good -0.7649
In case you want for multiple phrases the overall sentiment scores. 如果您想要多个短语,则总体情感得分。
Hope it helps! 希望能帮助到你!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.