简体   繁体   English

R-标记化-TermDocumentMatrix中的单个和两个字母词

[英]R - Tokenization - single and two letter words in a TermDocumentMatrix

I am currently trying to do a little bit of text processing and I would like to get the one and two letter words in a TermDocumentMatrix. 我目前正在尝试进行一些文本处理,我想在TermDocumentMatrix中获得一个和两个字母词。

The issue is that it seems to display only 3 letter words and more. 问题是它似乎只显示3个字母或更多的单词。

    library(tm)
    library(RWeka)

    test<-'This is a test.'

    testmyCorpus<-Corpus(VectorSource(test))
    testTDF<-TermDocumentMatrix(testmyCorpus, control=list(tokenize=AlphabeticTokenizer))
    inspect(testTDF)

Only the words "this" and "test" are displayed. 仅显示单词“ this”和“ test”。 Any ideas? 有任何想法吗?

Thanks a lot for you help! 非常感谢您的帮助! Robert 罗伯特

是几乎解决您问题的答案:简而言之,您应该在control=list(wordLengths=c(1,Inf)添加一个control=list(wordLengths=c(1,Inf)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM