使用短語而不是單個單詞在R中進行主題建模

Question

我正在嘗試進行一些主題建模，但想使用它們存在的短語而不是單個單詞，即

library(topicmodels)
library(tm)
my.docs = c('the sky is blue, hot sun', 'flowers,hot sun', 'black cats, bees, rats and mice')
my.corpus = Corpus(VectorSource(my.docs))
my.dtm = DocumentTermMatrix(my.corpus)
inspect(my.dtm)

當我檢查dtm時，它會將所有單詞分開，但我希望所有短語都合並在一起，即每個單詞都應有一個列：天空是藍色炎熱的太陽花黑貓蜜蜂大鼠和老鼠

如何使文檔術語表識別短語和單詞？ 他們以逗號分隔

解決方案必須高效，因為我想在大量數據上運行它

Answer 1

您可以嘗試使用自定義標記器的方法。 您將想要的多詞術語定義為短語（我不知道執行該步驟的算法代碼）：

tokenizing.phrases <- c("sky is blue", "hot sun", "black cats")

請注意，不會執行任何詞干處理，因此，如果您既想要“黑貓”又要“黑貓”，則需要輸入兩個變體。 大小寫被忽略。

然后，您需要創建一個函數：

    phraseTokenizer <- function(x) {
      require(stringr)

      x <- as.character(x) # extract the plain text from the tm TextDocument object
      x <- str_trim(x)
      if (is.na(x)) return("")
      #warning(paste("doing:", x))
      phrase.hits <- str_detect(x, ignore.case(tokenizing.phrases))

      if (any(phrase.hits)) {
        # only split once on the first hit, so you don't have to worry about multiple occurrences of the same phrase
        split.phrase <- tokenizing.phrases[which(phrase.hits)[1]] 
        # warning(paste("split phrase:", split.phrase))
        temp <- unlist(str_split(x, ignore.case(split.phrase), 2))
        out <- c(phraseTokenizer(temp[1]), split.phrase, phraseTokenizer(temp[2])) 
      } else {
        out <- MC_tokenizer(x)
      }


 out[out != ""]
}

然后，您可以照常進行操作以創建術語文檔矩陣，但是這一次您可以通過控制參數將標記化的短語包括在語料庫中。

tdm <- TermDocumentMatrix(corpus, control = list(tokenize = phraseTokenizer))

Answer 2

也許看一下有關該主題的最新出版物：

http://web.engr.illinois.edu/~hanj/pdf/kdd13_cwang.pdf

他們提供了一種算法，用於識別短語並將文檔划分/標記為這些短語。

使用短語而不是單個單詞在R中進行主題建模

問題描述

2 個解決方案

解決方案1
5 2015-02-02 12:47:54

解決方案2
0 2015-02-03 21:13:12

使用短語而不是單個單詞在R中進行主題建模

問題描述

2 個解決方案

解決方案1 5 2015-02-02 12:47:54

解決方案2 0 2015-02-03 21:13:12

解決方案1
5 2015-02-02 12:47:54

解決方案2
0 2015-02-03 21:13:12