简体   繁体   中英

R - How to apply terms from training document-term-matrix (dtm) to test dtm (both unigrams and bigrams)?

I am training a simple text classification method on 1,000 training examples and would like to make predictions on unseen test data (about 500,000 observations).

The script is working fine, when I work only with unigrams. However, I am not sure how to use control = list(dictionary=Terms(dtm_train_unigram)) when working with unigrams and bigrams as I have two separate document-term-matrices (one for unigrams, one for bigrams, see below):

  UnigramTokenizer <- function(x) unlist(lapply(NLP::ngrams(words(x), 1), paste, collapse = " "), use.names = FALSE)
  dtm_train_unigram <- DocumentTermMatrix(processed_dataset, control = list(tokenize = UnigramTokenizer, wordLengths=c(3,20), bounds = list(global = c(4,Inf))))

  BigramTokenizer <- function(x) unlist(lapply(NLP::ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
  dtm_train_bigram <- DocumentTermMatrix(processed_dataset, control = list(tokenize = BigramTokenizer, wordLengths=c(6,20), bounds = list(global = c(7,Inf))))

To ensure that the test set has the same terms as the training set, I use the following function:

corpus_test <- VCorpus(VectorSource(test_set))
dtm_test <- DocumentTermMatrix(corpus_test, control = list(dictionary=Terms(dtm_train_unigram), wordLengths = c(3,20)))

How do I feed the terms of both the dtm_train_unigram and the dtm_train_bigram to the dtm_test?

  1. Can I combine dtm_train_unigram and dtm_train_bigram to a single dtm after creating them separately (as currently done)?
  2. Can I simplify my two-step Tokenizer function so I only create a single dtm with unigrams and bigrams in the first place?

Thank you!

Answering your questions:

Official documentation of tm states the following for combining things.:

Combine several corpora into a single one, combine multiple documents into a corpus, combine multiple term-document matrices into a single one, or combine multiple term frequency vectors into a single term-document matrix.

which in your case would be the answer to 1:

my_dtms <- c(dtm_train_unigram, dtm_train_bigram)

But it does result in doubling the number of documents which is actually not the case.

So we come to point 2, you can create a tokenizer from the NLP package that handles more than just 1 instance of n-gram:

my_tokenizer <- function(x) unlist(lapply(NLP::ngrams(words(x), 1:2), paste, collapse = " "), use.names = FALSE)

note the vector 1:2 ngram function. Change this to 1:3 for 1, 2, 3 grams or 2:3 for just 2 and 3 grams.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM