如何构建主题标签语料库（文本挖掘）

Question

我正在尝试通过挖掘所有主题标签来分析Twitter数据。 我想将所有主题标签放在一个语料库中，并将该语料库映射到单词列表。 您知道我该如何解决这个问题吗？ 这是我的数据

这是我使用的代码，但我的DTM出现了100％稀疏的问题

step1 <- strsplit(newFile$Hashtag, "#")
step2 <- lapply(step1, tail, -1)
result <- lapply(step2, function(x){
sapply(strsplit(x, " "), head, 1)
})
result2<-do.call(c, unlist(result, recursive=FALSE))
myCorpus <- tm::Corpus(VectorSource(result2)) # create a corpus

这是有关我的语料库的信息

myCorpus
  <<SimpleCorpus>>
 Metadata:  corpus specific: 1, document level (indexed): 0
 Content:  documents: 12635

还有我的DTM

<<DocumentTermMatrix (documents: 12635, terms: 6280)>>
Non-/sparse entries: 12285/79335515
Sparsity           : 100%
Maximal term length: 36
Weighting          : term frequency (tf)

Answer 1

您的问题是您正在使用str_split 。 你应该试试：

str_extract_all("This all are hashtag #hello #I #am #a #buch #of #hashtags", "#\\\\S+")

As results this list:
[[1]]
[1] "#hello"    "#I"        "#am"       "#a"        "#buch"     "#of"      
[7] "#hashtags"

如果您想要的结果是一个数据帧，请使用simplify = T ：

str_extract_all("This all are hashtag #hello #I #am #a #buch #of #hashtags", "#\\S+", simplify = T)

结果：

     [,1]     [,2] [,3]  [,4] [,5]    [,6]  [,7]       
[1,] "#hello" "#I" "#am" "#a" "#buch" "#of" "#hashtags"

如何构建主题标签语料库（文本挖掘）

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-12-20 11:34:18

如何构建主题标签语料库（文本挖掘）

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-12-20 11:34:18

解决方案1
0 已采纳 2017-12-20 11:34:18