简体   繁体   English

R tm 包:如何将文本与正参考词列表进行比较并返回正词出现次数

[英]R tm Package: How to compare text to positive reference word list and return count of positive word occurrences

What is the best approach to use the tm library to compare text to positive reference word list and return count of positive word occurrences I want to be able to return the sum of positive words in reference text.使用 tm 库将文本与正面参考词列表进行比较并返回正面词出现次数的最佳方法是什么我希望能够返回参考文本中正面词的总和。

Question: What is the best way to do this?问题:这样做的最佳方法是什么?

For example:例如:

positiveword_list <- c("happy", "great", "fabulous", "great")

reference text:参考文本:

exampleText <- c("ON A BRIGHT SPRING DAY in the year 1677, “the good ship 
Kent,” Captain Gregory Marlowe, Master, set sail from the great docks of London. She carried 230 English Quakers, outward bound for a new home in British North America. As the ship dropped down the Thames she was hailed by King Charles II, who happened to be sailing on the river. The two vessels made a striking contrast. The King’s yacht was sleek and proud in gleaming paintwork, with small cannons peeping through wreaths of gold leaf, a wooden unicorn prancing high above her prow, and the royal arms emblazoned upon her stern. She seemed to dance upon the water— new sails shining white in the sun, flags streaming bravely from her mastheads, officers in brilliant uniform, ladies in court costume, servants in livery, musicians playing, and spaniels yapping. At the center of attention was the saturnine figure of the King himself in all his regal splendor. On the other side of the river came the emigrant ship. She would have been bluff-bowed and round-sided, with dirty sails and a salt-stained hull, and a single ensign drooping from its halyard. Her bulwarks were lined with apprehensive passengers— some dressed in the rough gray homespun of the northern Pen-nines, others in the brown drab of London tradesmen, several in the blue suits of servant-apprentices, and a few in the tattered motley of the country poor.")

Here is some background:这是一些背景:

What I am trying to do is count the number of positive works and store the count in a dataframe as a new column.我想要做的是计算正面作品的数量并将计数作为新列存储在数据框中。

count <-    length(which(lapply(positiveword_list, grepl, x = exampleText]) == TRUE))

thus:因此:

dataframeIn %>% mutate( posCount <- (length(which(lapply(positiveword_list, grepl, x = text) == TRUE)))) 

where text is a column in dataFrameIn (ie dataFrameIn$text)其中 text 是 dataFrameIn 中的一列(即 dataFrameIn$text)

You can do this without using tm package.您可以在不使用tm包的情况下执行此操作。

Try this尝试这个

contained <- lapply(positiveword_list, grepl, x = exampleText)

lapply returns a list. lapply返回一个列表。

Words present:出现的话:

>positiveword_list[contained == T]
"great" "great"
>length(contained[contained==T])
2

Words not present:不存在的词:

>positiveword_list[contained == F]
"happy"    "fabulous"
>length(contained[contained==F])
2

Here's another method using a custom-built tool, where you can define a dictionary of positive words and apply this to any number of texts, to count the positive key words.这是使用定制工具的另一种方法,您可以在其中定义正面词词典并将其应用于任意数量的文本,以计算正面关键词。 This uses the quanteda package and the dfm() method to create a document-feature matrix, with the dictionary = argument.这使用quanteda包和dfm()方法来创建文档特征矩阵,并带有dictionary =参数。 (See ?dictionary .) (见?dictionary 。)

require(quanteda)
posDic <- dictionary(list(positive = positiveword_list))
myDfm <- dfm(exampleText, dictionary = posDic)
# Creating a dfm from a character vector ...
# ... lowercasing
# ... tokenizing
# ... indexing documents: 1 document
# ... indexing features: 157 feature types
# ... applying a dictionary consisting of 1 key
# ... created a 1 x 1 sparse dfm
# ... complete. 
# Elapsed time: 0.014 seconds.

as.data.frame(myDfm)
#       positive
# text1        1

# produces a data frame with the text and the positive count
cbind(text = exampleText, as.data.frame(myDfm))

Note: This is probably not important to the example, but the usage of "great" in the exampleText is not as a positive word.注意:这对示例来说可能并不重要,但是 exampleText 中“great”的用法并不是一个积极的词。 Illustrates the perils of polysemy and dictionaries.说明了多义词和字典的危险。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM