Here's the situation, one whose solution seemed to be simple at first, but that has turned out to be more complicated than I expected.
I have an R data frame with three columns: an ID, a column with texts (reviews), and one with numeric values which I want to predict based on the text.
I have already done some preprocessing on the text column, so it is free of punctuation, in lower case, and ready to be tokenized and turned into a matrix so I can train a model on it. The problem is I can't figure out how to remove the stop words from that text.
Here's what I am trying to do with the text2vec package. I was planning on doing the stop-word removal before this chunk at first. But anywhere will do.
library(text2vec)
test_data <- data.frame(review_id=c(1,2,3),
review=c('is a masterpiece a work of art',
'sporting some of the best writing and voice work',
'better in every possible way when compared'),
score=c(90, 100, 100))
tokens <- word_tokenizer(test_data$review)
document_term_matrix <- create_dtm(itoken(tokens), hash_vectorizer())
model_tfidf <- TfIdf$new()
document_term_matrix <- model_tfidf$fit_transform(document_term_matrix)
document_term_matrix <- as.matrix(document_term_matrix)
I am hoping to get the review column to be something like:
review=c('masterpiec work art',
'sporting best writing voice work',
'better possible way compared')
You can use tidytext
package for this:
library(tidytext)
library(dplyr)
test_data %>%
unnest_tokens(review, review) %>%
anti_join(stop_words, by= c("review" = "word"))
# review_id review score
#1.2 1 masterpiece 90
#1.6 1 art 90
#2 2 sporting 100
#2.5 2 writing 100
#2.7 2 voice 100
#3.6 3 compared 100
To get the words back in one row you could do:
test_data %>%
unnest_tokens(review, review) %>%
anti_join(stop_words, by= c("review" = "word")) %>%
group_by(review_id, score) %>%
summarise(review = paste0(review, collapse = ' '))
# review_id score review
# <dbl> <dbl> <chr>
#1 1 90 masterpiece art
#2 2 100 sporting writing voice
#3 3 100 compared
It turns out that I ended up solving my own problem.
I created the following function:
remove_words_from_text <- function(text) {
text <- unlist(strsplit(text, " "))
paste(text[!text %in% words_to_remove], collapse = " ")
}
And called it via lapply.
words_to_remove <- stop_words$word
test_data$review <- lapply(test_data$review, remove_words_from_text)
Here's hoping that helps those who have the same problem that I did.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.