[英]How can I convert an R data frame with a single column into a corpus for tm such that each row is taken as a document?
我想使用tm
包的findAssocs
命令,但只有在語料庫中有多個文檔時它才有效。 相反,我有一個單列數據框,其中每一行包含來自Tweet的文本。 是否可以將其轉換為將每行作為新文檔的語料庫?
VCorpus (documents: 1, metadata (corpus/indexed): 0/0)
TermDocumentMatrix (terms: 71, documents: 1)
我有10行數據,我希望它被轉換為
VCorpus (documents: 10, metadata (corpus/indexed): 0/0)
TermDocumentMatrix (terms: 71, documents: 10)
我建議你在繼續之前先閱讀tm
-vignette。 回答下面的具體問題。
創建示例數據:
txt <- strsplit("I wanted to use the findAssocs of the tm package. but it works only when there are more than one documents in the corpus. I have a data frame table which has one column and each row has a tweet text. Is it possible to convert the into a corpus which takes each row as a new document?", split=" ")[[1]]
data <- data.frame(text=txt, stringsAsFactors=FALSE)
data[1:5, ]
將您的數據導入“源”,將“源”導入“語料庫”,然后從“語料庫”中創建TDM:
library(tm)
tdm <- TermDocumentMatrix(Corpus(DataframeSource(data)))
show(tdm)
#A term-document matrix (35 terms, 58 documents)
#
#Non-/sparse entries: 43/1987
#Sparsity : 98%
#Maximal term length: 10
#Weighting : term frequency (tf)
str(tdm)
#List of 6
# $ i : int [1:43] 32 31 28 12 28 21 3 35 20 33 ...
# $ j : int [1:43] 2 4 5 6 8 10 11 13 14 15 ...
# $ v : num [1:43] 1 1 1 1 1 1 1 1 1 1 ...
# $ nrow : int 35
# $ ncol : int 58
# $ dimnames:List of 2
# ..$ Terms: chr [1:35] "and" "are" "but" "column" ...
# ..$ Docs : chr [1:58] "1" "2" "3" "4" ...
# - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
# - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.