简体   繁体   English

stemDocument可在TermDocumentMatrix中工作,但不能在使用tm和R的tm_map中工作

[英]stemDocument works in TermDocumentMatrix but does not work in tm_map using tm and R

Let's say there is a string "COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12". 假设有一个字符串“彩色铅笔STAEDTLER NORIS CLUB ASSORTED COLORS PKT12”。 My code is: 我的代码是:

> a1 <- VCorpus(VectorSource("COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12"))
> a3 <- TermDocumentMatrix(a1,control = list(stemming=T))

The matrix is: 矩阵为:

           Docs
Terms       1
  assort    1
  club      1
  color     2
  nori      1
  pencil    1
  pkt12     1
  staedtler 1

So we can see stemDocument works for colored and colors, both of which turned to be color. 因此,我们可以看到stemDocument适用于彩色和彩色,它们都变成了彩色。 However, if I do: 但是,如果我这样做:

> a1 <- VCorpus(VectorSource("COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12"))
> a2 <- a1 %>% tm_map(PlainTextDocument) %>% tm_map(stemDocument,"english")
> a2[[1]]$content
[1] "COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12"
> a2 <- a2 %>% TermDocumentMatrix()

The matrix is: 矩阵为:

           Docs
Terms       character(0)
  assorted             1
  club                 1
  colored              1
  colors               1
  noris                1
  pencil               1
  pkt12                1
  staedtler            1

We can see stemDocument does not work here. 我们可以看到stemDocument在这里不起作用。 I notice that there is "character(0)" here which is not shown in the above matrix. 我注意到这里没有上面的矩阵中显示的“ character(0)”。 But I do not know why? 但是我不知道为什么?

My situation is I need to do some pre-processing for the text data like stopWords, stemDocument and so on. 我的情况是我需要对文本数据进行一些预处理,例如stopWords,stemDocument等。 Then I need to save this processed text to a csv file. 然后,我需要将此处理后的文本保存到一个csv文件中。 So here I cannot directly use TermDocumentMatrix to generate the matrix. 所以在这里我不能直接使用TermDocumentMatrix生成矩阵。 Could anyone help me out here? 有人可以帮我吗? Thanks a lot. 非常感谢。

This should help you achieve what you want, I usually convert all the text to lower case, remove punctuation marks etc, before creating the dtm/tdm 这应该可以帮助您实现所需的目标,在创建dtm / tdm之前,我通常将所有文本转换为小写,删除标点符号等。

library(tm)
txt <- "COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12"

txt <- tolower(txt) ## this is the extra step where I have converted eveything to lower case 

a1 <- VCorpus(VectorSource(txt))
a2 <- a1 %>%  tm_map(stemDocument) 
a2 <- a2 %>% TermDocumentMatrix()
inspect(a2)

character(0) appears because of calling PlainTextDocument(). 由于调用PlainTextDocument()而出现了字符(0)。 In cases where its necessary to use it , like when you use pass tolower to tm_map and get this error - Error: inherits(doc, "TextDocument") is not TRUE , use content_transformer. 在需要使用它的情况下(例如,当您使用tomlower传递到tm_map并得到此错误时)- Error: inherits(doc, "TextDocument") is not TRUE ,请使用content_transformer。

Hope this helps. 希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM