R-标记化-TermDocumentMatrix中的单个和两个字母词

Question

I am currently trying to do a little bit of text processing and I would like to get the one and two letter words in a TermDocumentMatrix. 我目前正在尝试进行一些文本处理，我想在TermDocumentMatrix中获得一个和两个字母词。

The issue is that it seems to display only 3 letter words and more. 问题是它似乎只显示3个字母或更多的单词。

    library(tm)
    library(RWeka)

    test<-'This is a test.'

    testmyCorpus<-Corpus(VectorSource(test))
    testTDF<-TermDocumentMatrix(testmyCorpus, control=list(tokenize=AlphabeticTokenizer))
    inspect(testTDF)

Only the words "this" and "test" are displayed. 仅显示单词“ this”和“ test”。 Any ideas? 有任何想法吗？

Thanks a lot for you help! 非常感谢您的帮助！ Robert 罗伯特

Answer 1

这是几乎解决您问题的答案：简而言之，您应该在control=list(wordLengths=c(1,Inf)添加一个control=list(wordLengths=c(1,Inf) 。

R-标记化-TermDocumentMatrix中的单个和两个字母词

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-02-24 19:22:50

R-标记化-TermDocumentMatrix中的单个和两个字母词

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-02-24 19:22:50

解决方案1
2 已采纳 2015-02-24 19:22:50