[英]stemDocument works in TermDocumentMatrix but does not work in tm_map using tm and R
Let's say there is a string "COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12". 假设有一个字符串“彩色铅笔STAEDTLER NORIS CLUB ASSORTED COLORS PKT12”。 My code is:
我的代码是:
> a1 <- VCorpus(VectorSource("COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12"))
> a3 <- TermDocumentMatrix(a1,control = list(stemming=T))
The matrix is: 矩阵为:
Docs
Terms 1
assort 1
club 1
color 2
nori 1
pencil 1
pkt12 1
staedtler 1
So we can see stemDocument works for colored and colors, both of which turned to be color. 因此,我们可以看到stemDocument适用于彩色和彩色,它们都变成了彩色。 However, if I do:
但是,如果我这样做:
> a1 <- VCorpus(VectorSource("COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12"))
> a2 <- a1 %>% tm_map(PlainTextDocument) %>% tm_map(stemDocument,"english")
> a2[[1]]$content
[1] "COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12"
> a2 <- a2 %>% TermDocumentMatrix()
The matrix is: 矩阵为:
Docs
Terms character(0)
assorted 1
club 1
colored 1
colors 1
noris 1
pencil 1
pkt12 1
staedtler 1
We can see stemDocument does not work here. 我们可以看到stemDocument在这里不起作用。 I notice that there is "character(0)" here which is not shown in the above matrix.
我注意到这里没有上面的矩阵中显示的“ character(0)”。 But I do not know why?
但是我不知道为什么?
My situation is I need to do some pre-processing for the text data like stopWords, stemDocument and so on. 我的情况是我需要对文本数据进行一些预处理,例如stopWords,stemDocument等。 Then I need to save this processed text to a csv file.
然后,我需要将此处理后的文本保存到一个csv文件中。 So here I cannot directly use TermDocumentMatrix to generate the matrix.
所以在这里我不能直接使用TermDocumentMatrix生成矩阵。 Could anyone help me out here?
有人可以帮我吗? Thanks a lot.
非常感谢。
This should help you achieve what you want, I usually convert all the text to lower case, remove punctuation marks etc, before creating the dtm/tdm 这应该可以帮助您实现所需的目标,在创建dtm / tdm之前,我通常将所有文本转换为小写,删除标点符号等。
library(tm)
txt <- "COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12"
txt <- tolower(txt) ## this is the extra step where I have converted eveything to lower case
a1 <- VCorpus(VectorSource(txt))
a2 <- a1 %>% tm_map(stemDocument)
a2 <- a2 %>% TermDocumentMatrix()
inspect(a2)
character(0) appears because of calling PlainTextDocument(). 由于调用PlainTextDocument()而出现了字符(0)。 In cases where its necessary to use it , like when you use pass tolower to tm_map and get this error -
Error: inherits(doc, "TextDocument") is not TRUE
, use content_transformer. 在需要使用它的情况下(例如,当您使用tomlower传递到tm_map并得到此错误时)-
Error: inherits(doc, "TextDocument") is not TRUE
,请使用content_transformer。
Hope this helps. 希望这可以帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.