简体   繁体   English

R删除数据框中的多个文本字符串

[英]R remove multiple text strings in data frame

New to R. I am looking to remove certain words from a data frame. R.的新功能我希望从数据框中删除某些单词。 Since there are multiple words, I would like to define this list of words as a string, and use gsub to remove. 由于有多个单词,我想将这个单词列表定义为字符串,并使用gsub删除。 Then convert back to a dataframe and maintain same structure. 然后转换回数据帧并保持相同的结构。

wordstoremove <- c("ai", "computing", "ulitzer", "ibm", "privacy", "cognitive")

a
id                text time      username          
 1     "ai and x"        10     "me"          
 2     "and computing"   5      "you"         
 3     "nothing"         15     "everyone"     
 4     "ibm privacy"     0      "know"        

I was thinking something like: 我想的是:

a2 <- apply(a, 1, gsub(wordstoremove, "", a)

but clearly this doesnt work, before converting back to a data frame. 但在转换回数据框之前,这显然不起作用。

wordstoremove <- c("ai", "computing", "ulitzer", "ibm", "privacy", "cognitive")

(dat <- read.table(header = TRUE, text = 'id text time username
1 "ai and x" 10 "me"
2 "and computing" 5 "you"
3 "nothing" 15 "everyone"
4 "ibm privacy" 0 "know"'))

#   id          text time username
# 1  1      ai and x   10       me
# 2  2 and computing    5      you
# 3  3       nothing   15 everyone
# 4  4   ibm privacy    0     know

(dat1 <- as.data.frame(sapply(dat, function(x) 
  gsub(paste(wordstoremove, collapse = '|'), '', x))))

#   id    text time username
# 1  1   and x   10       me
# 2  2    and     5      you
# 3  3 nothing   15 everyone
# 4  4            0     know

Another option using dplyr::mutate() and stringr::str_remove_all() : 使用dplyr::mutate()stringr::str_remove_all()另一个选项:

library(dplyr)
library(stringr)

dat <- dat %>%   
  mutate(text = str_remove_all(text, regex(str_c("\\b",wordstoremove, "\\b", collapse = '|'), ignore_case = T)))

Because lowercase 'ai' could easily be a part of a longer word, the words to remove are bound with \\\\b so that they are not removed from the beginning, middle, or end or other words. 因为小写'ai'很容易成为较长单词的一部分,所以要删除的单词与\\\\b绑定,以便它们不会从开头,中间或结尾或其他单词中删除。

The search pattern is also wrapped with regex(pattern, ignore_case = T) in case some words are capitalized in the text string. 如果某些单词在文本字符串中大写, regex(pattern, ignore_case = T)搜索模式也用regex(pattern, ignore_case = T)包装。

str_replace_all() could be used if you wanted to replace the words with something other than just removing them. 如果你想用除了删除它们之外的东西替换单词,可以使用str_replace_all() str_remove_all() is just an alias for str_replace_all(string, pattern, '') . str_remove_all()只是str_replace_all(string, pattern, '')的别名。

rawr's anwswer could be updated to: rawr的anwswer可以更新为:

dat1 <- as.data.frame(sapply(dat, function(x) 
  gsub(paste0('\\b', wordstoremove, '\\b', collapse = '|'), '', x, ignore.case = T)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM