stemDocument可在TermDocumentMatrix中工作，但不能在使用tm和R的tm_map中工作

Question

Let's say there is a string "COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12". 假设有一个字符串“彩色铅笔STAEDTLER NORIS CLUB ASSORTED COLORS PKT12”。 My code is: 我的代码是：

> a1 <- VCorpus(VectorSource("COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12"))
> a3 <- TermDocumentMatrix(a1,control = list(stemming=T))

The matrix is: 矩阵为：

           Docs
Terms       1
  assort    1
  club      1
  color     2
  nori      1
  pencil    1
  pkt12     1
  staedtler 1

So we can see stemDocument works for colored and colors, both of which turned to be color. 因此，我们可以看到stemDocument适用于彩色和彩色，它们都变成了彩色。 However, if I do: 但是，如果我这样做：

> a1 <- VCorpus(VectorSource("COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12"))
> a2 <- a1 %>% tm_map(PlainTextDocument) %>% tm_map(stemDocument,"english")
> a2[[1]]$content
[1] "COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12"
> a2 <- a2 %>% TermDocumentMatrix()

The matrix is: 矩阵为：

           Docs
Terms       character(0)
  assorted             1
  club                 1
  colored              1
  colors               1
  noris                1
  pencil               1
  pkt12                1
  staedtler            1

We can see stemDocument does not work here. 我们可以看到stemDocument在这里不起作用。 I notice that there is "character(0)" here which is not shown in the above matrix. 我注意到这里没有上面的矩阵中显示的“ character（0）”。 But I do not know why? 但是我不知道为什么？

My situation is I need to do some pre-processing for the text data like stopWords, stemDocument and so on. 我的情况是我需要对文本数据进行一些预处理，例如stopWords，stemDocument等。 Then I need to save this processed text to a csv file. 然后，我需要将此处理后的文本保存到一个csv文件中。 So here I cannot directly use TermDocumentMatrix to generate the matrix. 所以在这里我不能直接使用TermDocumentMatrix生成矩阵。 Could anyone help me out here? 有人可以帮我吗？ Thanks a lot. 非常感谢。

Answer 1

This should help you achieve what you want, I usually convert all the text to lower case, remove punctuation marks etc, before creating the dtm/tdm 这应该可以帮助您实现所需的目标，在创建dtm / tdm之前，我通常将所有文本转换为小写，删除标点符号等。

library(tm)
txt <- "COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12"

txt <- tolower(txt) ## this is the extra step where I have converted eveything to lower case 

a1 <- VCorpus(VectorSource(txt))
a2 <- a1 %>%  tm_map(stemDocument) 
a2 <- a2 %>% TermDocumentMatrix()
inspect(a2)

character(0) appears because of calling PlainTextDocument(). 由于调用PlainTextDocument（）而出现了字符（0）。 In cases where its necessary to use it , like when you use pass tolower to tm_map and get this error - Error: inherits(doc, "TextDocument") is not TRUE , use content_transformer. 在需要使用它的情况下（例如，当您使用tomlower传递到tm_map并得到此错误时）- Error: inherits(doc, "TextDocument") is not TRUE ，请使用content_transformer。

Hope this helps. 希望这可以帮助。

stemDocument可在TermDocumentMatrix中工作，但不能在使用tm和R的tm_map中工作

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-11-07 07:06:33

stemDocument可在TermDocumentMatrix中工作，但不能在使用tm和R的tm_map中工作

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-11-07 07:06:33

解决方案1
1 已采纳 2016-11-07 07:06:33