从 R 数据框列中删除停用词

Question

Here's the situation, one whose solution seemed to be simple at first, but that has turned out to be more complicated than I expected.情况是这样的，一开始的解决方案似乎很简单，但结果却比我预期的要复杂。

I have an R data frame with three columns: an ID, a column with texts (reviews), and one with numeric values which I want to predict based on the text.我有一个包含三列的 R 数据框：一个 ID，一个包含文本（评论）的列，以及一个包含我想根据文本预测的数值的列。

I have already done some preprocessing on the text column, so it is free of punctuation, in lower case, and ready to be tokenized and turned into a matrix so I can train a model on it.我已经对文本列进行了一些预处理，因此它没有标点符号、小写字母，并且可以进行标记化并转换为矩阵，因此我可以在其上训练 model。 The problem is I can't figure out how to remove the stop words from that text.问题是我不知道如何从该文本中删除停用词。

Here's what I am trying to do with the text2vec package. I was planning on doing the stop-word removal before this chunk at first.这是我尝试对 text2vec package 执行的操作。我最初计划在此块之前删除停用词。 But anywhere will do.但任何地方都可以。

library(text2vec)

test_data <- data.frame(review_id=c(1,2,3),
                        review=c('is a masterpiece a work of art',
                        'sporting some of the best writing and voice work',
                        'better in every possible way when compared'),
                         score=c(90, 100, 100))

tokens <- word_tokenizer(test_data$review)
document_term_matrix <- create_dtm(itoken(tokens), hash_vectorizer())
model_tfidf <- TfIdf$new()
document_term_matrix <- model_tfidf$fit_transform(document_term_matrix)

document_term_matrix <- as.matrix(document_term_matrix)

I am hoping to get the review column to be something like:我希望评论栏是这样的：

review=c('masterpiec work art',
         'sporting best writing voice work',
         'better possible way compared')

Answer 1

You can use tidytext package for this:您可以为此使用tidytext package：

library(tidytext)
library(dplyr)

test_data %>%
  unnest_tokens(review, review) %>%
  anti_join(stop_words, by= c("review" = "word"))

#    review_id      review score
#1.2         1 masterpiece    90
#1.6         1         art    90
#2           2    sporting   100
#2.5         2     writing   100
#2.7         2       voice   100
#3.6         3    compared   100

To get the words back in one row you could do:要将单词重新排成一排，您可以这样做：

test_data %>%
  unnest_tokens(review, review) %>%
  anti_join(stop_words, by= c("review" = "word")) %>%
  group_by(review_id, score) %>%
  summarise(review = paste0(review, collapse = ' '))

#  review_id score review                
#      <dbl> <dbl> <chr>                 
#1         1    90 masterpiece art       
#2         2   100 sporting writing voice
#3         3   100 compared

Answer 2

It turns out that I ended up solving my own problem.事实证明，我最终解决了自己的问题。

I created the following function:我创建了以下 function：

remove_words_from_text <- function(text) {
  text <- unlist(strsplit(text, " "))
  paste(text[!text %in% words_to_remove], collapse = " ")
}

And called it via lapply.并通过 lapply 调用它。

words_to_remove <- stop_words$word
test_data$review <- lapply(test_data$review, remove_words_from_text)

Here's hoping that helps those who have the same problem that I did.希望能帮到和我遇到同样问题的人。

从 R 数据框列中删除停用词

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-12-22 01:57:32

解决方案2
0 2020-12-22 00:59:31

从 R 数据框列中删除停用词

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-12-22 01:57:32

解决方案2 0 2020-12-22 00:59:31

解决方案1
2 已采纳 2020-12-22 01:57:32

解决方案2
0 2020-12-22 00:59:31