简体   繁体   English

如何将情绪词典导入R以进行Kickstarter的数据抓取

[英]How to import emotion lexicon into R for data scraping of Kickstarter

I am attempting to create a model to measure emotion in text using R. Basically, using a lexicon with emotion words, i want to only extract the 'p' (paragraph) from a large number of URL's. 我正在尝试使用R创建一个模型来测量文本中的情感。基本上,使用带有情感词的词典,我只想从大量的URL中提取“p”(段落)。 I am looking to find the word-count per emotion per URL, based on presence of pre-defined emotion- indicating words by using a lexicon. 我希望通过使用词典基于预定义的情感指示单词的存在来找到每个URL的每个情感的单词计数。 Lexicon link 词典链接

The data I use is in JSON format, from Webrobots: Dataset Link (the latest set). 我使用的数据是JSON格式,来自Webrobots: Dataset Link (最新的集合)。

Any help would be much appreciated, as I am really desperate to get started on this! 任何帮助将不胜感激,因为我真的很想开始这个! Even just knowing how i could import this into R and a code to count the words would be of great help. 即使只是知道我如何将其导入R和一个代码来计算单词将是非常有帮助的。

Kind regards, a desperate R-illiterate girl. 亲切的问候,一个绝望的R文盲女孩。

Update: the data file is imported into R. However, I cannot find a way to write a code that tests for the presence of the lexicon-indicated words to run against the data. 更新:数据文件被导入到R.但是,我找不到一种方法来编写一个代码来测试是否存在针对数据运行的词典指示的单词。 I seek to create 6 new variables with the counts of each campaign for the six basic emotions (happy, sad, anger, surprise, fear, disgust) that show the word count for the presence of these emotions 我试图创建6个新变量,其中包含六种基本情绪(快乐,悲伤,愤怒,惊讶,恐惧,厌恶)的每个竞选计数,显示这些情绪的存在。

The file I have already indicated the paragraph 'p' part at closer look. 我已经仔细观察了文件“ p”部分。 I just need to categorize it contents. 我只需要对其内容进行分类。

Lexicon list download 词典列表下载

  1. The first step for you is to manually download (a simple copy and paste) of the lexicon list from this link and save it in .csv format: 第一步是从此链接手动下载(简单的复制和粘贴)词典列表,并将其保存为.csv格式:

http://www.saifmohammad.com/WebDocs/NRC-AffectIntensity-Lexicon.txt http://www.saifmohammad.com/WebDocs/NRC-AffectIntensity-Lexicon.txt

Then you need to break down this list into 4 separate parts, each part should have one affect. 然后你需要将这个列表分成4个独立的部分,每个部分应该有一个影响。 This will result in 4 .csv files as: 这将导致4个.csv文件:

anger_list = w.csv
fear_list  = x.csv
joy_list   = y.csv
sad_list   = z.csv

If you do not want to do this manually, there is an alternative lexicon list where data is directly downloadable into separate files: https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon 如果您不想手动执行此操作,则可以使用另一个词典列表,将数据直接下载到单独的文件中: https : //www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon

Text data download 文本数据下载

  1. The other link you shared ( http://webrobots.io/Kickstarter-datasets/ ) now seems to have both JSON and csv files and reading it into R seems quite straight forward. 您共享的另一个链接( http://webrobots.io/Kickstarter-datasets/ )现在似乎同时具有JSON和csv文件,并且将其读入R似乎很简单。

Cleaning of URLs for text extraction 清理URL以进行文本提取

  1. I am not sure of the column/field that you are interested in analysing; 我不确定您有兴趣分析的列/字段; as the data set that I downloaded as of February 2019 does not have a field 'p'. 因为我在2019年2月下载的数据集没有字段'p'。

Since your mentioned the presence of URLs, I am also sharing a brief code for possible editing or cleaning of URLs. 由于您提到了URL的存在,我还分享了一个简短的代码,用于编辑或清理URL。 This will help you get clean textual data out of URLs: 这将帮助您从URL中获取干净的文本数据:

replacePunctuation <- function(x)
{

  # Lowercase all words for convenience
  x <- tolower(x)

  # Remove words with multiple consecutive digits in them (3 in this case) 
  x <- gsub("[a-zA-Z]*([0-9]{3,})[a-zA-Z0-9]* ?", " ", x)

  # Remove extra punctuation
  x <- gsub("[.]+[ ]"," ",x) # full stop
  x <- gsub("[:]+[ ]"," ",x) # Colon
  x <- gsub("[?]"," ",x)     # Question Marks
  x <- gsub("[!]"," ",x)     # Exclamation Marks
  x <- gsub("[;]"," ",x)     # Semi colon
  x <- gsub("[,]"," ",x)     # Comma
  x <- gsub("[']"," ",x)     # Apostrophe
  x <- gsub("[-]"," ",x)     # Hyphen
  x <- gsub("[#]"," ",x)     

  # Remove all newline characters
  x <- gsub("[\r\n]", " ", x)

  # Regex pattern for removing stop words
  stop_pattern <- paste0("\\b(", paste0(stopwords("en"), collapse="|"), ")\\b")
  x <- gsub(stop_pattern, " ", x)

  # Replace whitespace longer than 1 space with a single space
  x <- gsub(" {2,}", " ", x)

  x
}

Code for adding scores on sentiment or affect 增加情绪或影响分数的代码

  1. Next, I assume you have read your data as text in R. Let's say you have stored it as part of some data frame df$p . 接下来,假设您已将数据作为文本读取到R中。假设您已将其存储为某些数据框df $ p的一部分 The next step then would be to add additional columns to this data frame: 然后,下一步是向此数据框添加其他列:

     df$p # contains text of interest 

Now add additional columns to this data frame for each of the four affects 现在,针对这四个影响的每一个,在此数据框中添加其他列

df$ANGER   = 0
df$FEAR    = 0
df$JOY     = 0
df$SADNESS = 0

Then you simply loop through each row of df , breaking down the text p into words based on white space. 然后,您只需循环遍历df的每一行,将文本p分解为基于空格的单词。 Then you look for the occurrence of specific terms from your Lexicon list into the stripped words you got. 然后,您将从Lexicon列表中查找特定术语的出现,以查找您获得的剥离词。 You then assign scores for each affect as below: 然后,您为每种影响分配得分,如下所示:

for (i in 1:nrow(df))
{
  # counter initialization
  angry = 0
  feared = 0
  joyful = 0
  sad = 0

# for df, let's say the text 'p' is at first column place  
words <- strsplit(df[i,1], " ")[[1]]  
  for (j in 1:length(words))
  {
    if (words[j] %in% anger_list[,1])
      angry = angry + 1
    else {
      if (words[j] %in% fear_list[,1])   
        feared = feared + 1
      else { 
        if (words[j] %in% joy_list[,1])
          joyful = joyful + 1
        else
          sad = sad + 1
      } #else 2
    } #else 1
  } #for 2

  df[i,2] <- angry
  df[i,3] <- feared
  df[i,4] <- joyful
  df[i,5] <- sad

}#for 1

Please note that, in the above implementation I'm assuming a word can only represent one affect at a time. 请注意,在上面的实现中,我假设一个单词一次只能表示一个影响。 Meaning that I assume these affects are mutually exclusive. 这意味着我认为这些影响是相互排斥的。 However, I understand that for some of the terms in your text 'p', this might not be true so you should modify your code to incorporate for having multiple affects per term. 但是,我知道对于文本“ p”中的某些术语,可能并非如此,因此您应该修改代码以使其合并,以使每个术语具有多种影响。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM