stemDocument可在TermDocumentMatrix中工作，但不能在使用tm和R的tm_map中工作

Question

假設有一個字符串“彩色鉛筆STAEDTLER NORIS CLUB ASSORTED COLORS PKT12”。 我的代碼是：

> a1 <- VCorpus(VectorSource("COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12"))
> a3 <- TermDocumentMatrix(a1,control = list(stemming=T))

矩陣為：

           Docs
Terms       1
  assort    1
  club      1
  color     2
  nori      1
  pencil    1
  pkt12     1
  staedtler 1

因此，我們可以看到stemDocument適用於彩色和彩色，它們都變成了彩色。 但是，如果我這樣做：

> a1 <- VCorpus(VectorSource("COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12"))
> a2 <- a1 %>% tm_map(PlainTextDocument) %>% tm_map(stemDocument,"english")
> a2[[1]]$content
[1] "COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12"
> a2 <- a2 %>% TermDocumentMatrix()

矩陣為：

           Docs
Terms       character(0)
  assorted             1
  club                 1
  colored              1
  colors               1
  noris                1
  pencil               1
  pkt12                1
  staedtler            1

我們可以看到stemDocument在這里不起作用。 我注意到這里沒有上面的矩陣中顯示的“ character（0）”。 但是我不知道為什么？

我的情況是我需要對文本數據進行一些預處理，例如stopWords，stemDocument等。 然后，我需要將此處理后的文本保存到一個csv文件中。 所以在這里我不能直接使用TermDocumentMatrix生成矩陣。 有人可以幫我嗎？ 非常感謝。

Answer 1

這應該可以幫助您實現所需的目標，在創建dtm / tdm之前，我通常將所有文本轉換為小寫，刪除標點符號等。

library(tm)
txt <- "COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12"

txt <- tolower(txt) ## this is the extra step where I have converted eveything to lower case 

a1 <- VCorpus(VectorSource(txt))
a2 <- a1 %>%  tm_map(stemDocument) 
a2 <- a2 %>% TermDocumentMatrix()
inspect(a2)

由於調用PlainTextDocument（）而出現了字符（0）。 在需要使用它的情況下（例如，當您使用tomlower傳遞到tm_map並得到此錯誤時）- Error: inherits(doc, "TextDocument") is not TRUE ，請使用content_transformer。

希望這可以幫助。

stemDocument可在TermDocumentMatrix中工作，但不能在使用tm和R的tm_map中工作

問題描述

1 個解決方案

解決方案1
1 已采納 2016-11-07 07:06:33

stemDocument可在TermDocumentMatrix中工作，但不能在使用tm和R的tm_map中工作

問題描述

1 個解決方案

解決方案1 1 已采納 2016-11-07 07:06:33

解決方案1
1 已采納 2016-11-07 07:06:33