简体   繁体   English

使用短语而不是单个单词在R中进行主题建模

[英]Topic modelling in R using phrases rather than single words

I'm trying to do some topic modelling but want to use phrases where they exist rather than single words ie 我正在尝试进行一些主题建模,但想使用它们存在的短语而不是单个单词,即

library(topicmodels)
library(tm)
my.docs = c('the sky is blue, hot sun', 'flowers,hot sun', 'black cats, bees, rats and mice')
my.corpus = Corpus(VectorSource(my.docs))
my.dtm = DocumentTermMatrix(my.corpus)
inspect(my.dtm)

When I inspect my dtm it splits all the words up, but I want all the phrases together ie there should be a column for each of: the sky is blue hot sun flowers black cats bees rats and mice 当我检查dtm时,它会将所有单词分开,但我希望所有短语都合并在一起,即每个单词都应有一个列:天空是蓝色炎热的太阳花黑猫蜜蜂大鼠和老鼠

How do make the Document Term Matrix recognise phrases and words? 如何使文档术语表识别短语和单词? they are comma seperated 他们以逗号分隔

The solution needs to be efficient as I want to run it over a lot of data 解决方案必须高效,因为我想在大量数据上运行它

You might try an approach using a custom tokenizer. 您可以尝试使用自定义标记器的方法。 You define the multiple-word terms you want as phrases (I am not aware of an algorithmic code to do that step): 您将想要的多词术语定义为短语(我不知道执行该步骤的算法代码):

tokenizing.phrases <- c("sky is blue", "hot sun", "black cats")

Note that no stemming is done, so if you want both "black cats" and "black cat", then you will need to enter both variations. 请注意,不会执行任何词干处理,因此,如果您既想要“​​黑猫”又要“黑猫”,则需要输入两个变体。 Case is ignored. 大小写被忽略。

Then you need to create a function: 然后,您需要创建一个函数:

    phraseTokenizer <- function(x) {
      require(stringr)

      x <- as.character(x) # extract the plain text from the tm TextDocument object
      x <- str_trim(x)
      if (is.na(x)) return("")
      #warning(paste("doing:", x))
      phrase.hits <- str_detect(x, ignore.case(tokenizing.phrases))

      if (any(phrase.hits)) {
        # only split once on the first hit, so you don't have to worry about multiple occurrences of the same phrase
        split.phrase <- tokenizing.phrases[which(phrase.hits)[1]] 
        # warning(paste("split phrase:", split.phrase))
        temp <- unlist(str_split(x, ignore.case(split.phrase), 2))
        out <- c(phraseTokenizer(temp[1]), split.phrase, phraseTokenizer(temp[2])) 
      } else {
        out <- MC_tokenizer(x)
      }


 out[out != ""]
}

Then you proceed as normal to create a term document matrix, but this time you include the tokenized phrases in the corpus by means of the control argument. 然后,您可以照常进行操作以创建术语文档矩阵,但是这一次您可以通过控制参数将标记化的短语包括在语料库中。

tdm <- TermDocumentMatrix(corpus, control = list(tokenize = phraseTokenizer)) 

Maybe have a look at this relatively recent publication on that topic: 也许看一下有关该主题的最新出版物:

http://web.engr.illinois.edu/~hanj/pdf/kdd13_cwang.pdf http://web.engr.illinois.edu/~hanj/pdf/kdd13_cwang.pdf

they give an algorithm for identifying phrases and partitioning/tokenizing a document into those phrases. 他们提供了一种算法,用于识别短语并将文档划分/标记为这些短语。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM