简体   繁体   中英

How to import a lexicon in XML-LMF format for sentiment analysis in R

I'm trying to import the following lexicon in R, to be used with text mining packages such as quanteda , or to export it as a list or data frame:

https://github.com/opener-project/VU-sentiment-lexicon/tree/master/VUSentimentLexicon/IT-lexicon

The format is XML-LMF. I could not find any way to parse such a format with R.

(see https://en.wikipedia.org/wiki/Lexical_Markup_Framework )

As a workaround I tried to use the XML package, but the structure is a bit different from usual XML, and I did not manage to parse all the nodes.

I managed to make it work using the xml2 package. Here's my code:

library(xml2)
library(quanteda)

# Read file and find the nodes
opeNER_xml <- read_xml("it-sentiment_lexicon.lmf.xml")
entries <- xml_find_all(opeNER_xml, ".//LexicalEntry")
lemmas <- xml_find_all(opeNER_xml, ".//Lemma")
confidence <- xml_find_all(opeNER_xml, ".//Confidence")
sentiment <- xml_find_all(opeNER_xml, ".//Sentiment")

# Parse and put in a data frame
opeNER_df <- data.frame(
  id = xml_attr(entries, "id"),
  lemma = xml_attr(lemmas, "writtenForm"),
  partOfSpeech = xml_attr(entries, "partOfSpeech"),
  confidenceScore = as.numeric(xml_attr(confidence, "score")),
  method = xml_attr(confidence, "method"),
  polarity = as.character(xml_attr(sentiment, "polarity")),
  stringsAsFactors = F
)
# Fix a mistake
opeNER_df$polarity <- ifelse(opeNER_df$polarity == "nneutral", 
                             "neutral", opeNER_df$polarity)

# Make quanteda dictionary
opeNER_dict <- quanteda::dictionary(with(opeNER_df, split(lemma, polarity)))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM