简体   繁体   中英

Replacing words with spaces within a tibble in R without anti-join

I have a tibble of sentences like so:

A tibble: 1,782 x 1

Chat
<chr>                                                                                                                                                                    
1 Hi i would like to find out more about the trials
2 Hello I had a guest 
3 Hello my friend overseas right now
...

What I'm trying to do is to remove stopwords like "I", "hello". I already have a list of them and I want to replace these stopwords with a space. I tried using mutate and gsub but it only takes in a regex. Anti join won't work here as I am trying to do bigram/trigram I don't have a single word column to anti-join the stopwords.

Is there a way to replace all these words in each sentences in R?

We could unnest the tokens, replace the 'word' that is found in the 'stop_words' 'word' column with space ( " " ), and paste the 'word' after grouping by 'lines'

library(tidytext)
library(tidyverse)
rowid_to_column(df1, 'lines') %>% 
     unnest_tokens(word, Chat) %>% 
     mutate(word = replace(word, word %in% stop_words$word, " ")) %>% 
     group_by(lines) %>% 
     summarise(Chat = paste(word, collapse=' ')) %>%
     ungroup %>%
     select(-lines)

NOTE: This replaces the stop words found in 'stop_words' dataset to " " If we need only a custom subset of stop words to be replaced, then create a vector of those elements and do the change in the mutate step

v1 <- c("I", "hello", "Hi")
rowid_to_column(df1, 'lines') %>%
  ...
  ...
  mutate(word = replace(word %in% v1, " ")) %>%
  ...
  ...

We can construct a pattern with " \\\\b stop word \\\\b " and then use gsub to replace them with "". Here is an example. Notice that I set ignore.case = TRUE to include both lower and upper case, but you may want to adjust that for your needs.

dat <- read.table(text = "Chat
                  1 'Hi i would like to find out more about the trials'
                  2 'Hello I had a guest' 
                  3 'Hello my friend overseas right now'",
                  header = TRUE, stringsAsFactors = FALSE)

dat
#                                                Chat
# 1 Hi i would like to find out more about the trials
# 2                               Hello I had a guest
# 3                Hello my friend overseas right now

# A list of stop word
stopword <- c("I", "Hello", "Hi")
# Create the pattern
stopword2 <- paste0("\\b", stopword, "\\b")
stopword3 <- paste(stopword2, collapse = "|")

# View the pattern
stopword3
# [1] "\\bI\\b|\\bHello\\b|\\bHi\\b"

dat$Chat <- gsub(pattern = stopword3, replacement = " ", x = dat$Chat, ignore.case = TRUE)
dat
#                                               Chat
# 1     would like to find out more about the trials
# 2                                      had a guest
# 3                     my friend overseas right now

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM