[英]R - Tokenization - single and two letter words in a TermDocumentMatrix
I am currently trying to do a little bit of text processing and I would like to get the one and two letter words in a TermDocumentMatrix. 我目前正在尝试进行一些文本处理,我想在TermDocumentMatrix中获得一个和两个字母词。
The issue is that it seems to display only 3 letter words and more. 问题是它似乎只显示3个字母或更多的单词。
library(tm)
library(RWeka)
test<-'This is a test.'
testmyCorpus<-Corpus(VectorSource(test))
testTDF<-TermDocumentMatrix(testmyCorpus, control=list(tokenize=AlphabeticTokenizer))
inspect(testTDF)
Only the words "this" and "test" are displayed. 仅显示单词“ this”和“ test”。 Any ideas? 有任何想法吗?
Thanks a lot for you help! 非常感谢您的帮助! Robert 罗伯特
这是几乎解决您问题的答案:简而言之,您应该在control=list(wordLengths=c(1,Inf)
添加一个control=list(wordLengths=c(1,Inf)
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.