簡體   English   中英

R tm 包:如何將文本與正參考詞列表進行比較並返回正詞出現次數

[英]R tm Package: How to compare text to positive reference word list and return count of positive word occurrences

使用 tm 庫將文本與正面參考詞列表進行比較並返回正面詞出現次數的最佳方法是什么我希望能夠返回參考文本中正面詞的總和。

問題:這樣做的最佳方法是什么?

例如:

positiveword_list <- c("happy", "great", "fabulous", "great")

參考文本:

exampleText <- c("ON A BRIGHT SPRING DAY in the year 1677, “the good ship 
Kent,” Captain Gregory Marlowe, Master, set sail from the great docks of London. She carried 230 English Quakers, outward bound for a new home in British North America. As the ship dropped down the Thames she was hailed by King Charles II, who happened to be sailing on the river. The two vessels made a striking contrast. The King’s yacht was sleek and proud in gleaming paintwork, with small cannons peeping through wreaths of gold leaf, a wooden unicorn prancing high above her prow, and the royal arms emblazoned upon her stern. She seemed to dance upon the water— new sails shining white in the sun, flags streaming bravely from her mastheads, officers in brilliant uniform, ladies in court costume, servants in livery, musicians playing, and spaniels yapping. At the center of attention was the saturnine figure of the King himself in all his regal splendor. On the other side of the river came the emigrant ship. She would have been bluff-bowed and round-sided, with dirty sails and a salt-stained hull, and a single ensign drooping from its halyard. Her bulwarks were lined with apprehensive passengers— some dressed in the rough gray homespun of the northern Pen-nines, others in the brown drab of London tradesmen, several in the blue suits of servant-apprentices, and a few in the tattered motley of the country poor.")

這是一些背景:

我想要做的是計算正面作品的數量並將計數作為新列存儲在數據框中。

count <-    length(which(lapply(positiveword_list, grepl, x = exampleText]) == TRUE))

因此:

dataframeIn %>% mutate( posCount <- (length(which(lapply(positiveword_list, grepl, x = text) == TRUE)))) 

其中 text 是 dataFrameIn 中的一列(即 dataFrameIn$text)

您可以在不使用tm包的情況下執行此操作。

嘗試這個

contained <- lapply(positiveword_list, grepl, x = exampleText)

lapply返回一個列表。

出現的話:

>positiveword_list[contained == T]
"great" "great"
>length(contained[contained==T])
2

不存在的詞:

>positiveword_list[contained == F]
"happy"    "fabulous"
>length(contained[contained==F])
2

這是使用定制工具的另一種方法,您可以在其中定義正面詞詞典並將其應用於任意數量的文本,以計算正面關鍵詞。 這使用quanteda包和dfm()方法來創建文檔特征矩陣,並帶有dictionary =參數。 (見?dictionary 。)

require(quanteda)
posDic <- dictionary(list(positive = positiveword_list))
myDfm <- dfm(exampleText, dictionary = posDic)
# Creating a dfm from a character vector ...
# ... lowercasing
# ... tokenizing
# ... indexing documents: 1 document
# ... indexing features: 157 feature types
# ... applying a dictionary consisting of 1 key
# ... created a 1 x 1 sparse dfm
# ... complete. 
# Elapsed time: 0.014 seconds.

as.data.frame(myDfm)
#       positive
# text1        1

# produces a data frame with the text and the positive count
cbind(text = exampleText, as.data.frame(myDfm))

注意:這對示例來說可能並不重要,但是 exampleText 中“great”的用法並不是一個積極的詞。 說明了多義詞和字典的危險。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM