[英]R remove multiple text strings in data frame
New to R. I am looking to remove certain words from a data frame. R.的新功能我希望从数据框中删除某些单词。 Since there are multiple words, I would like to define this list of words as a string, and use gsub to remove. 由于有多个单词,我想将这个单词列表定义为字符串,并使用gsub删除。 Then convert back to a dataframe and maintain same structure. 然后转换回数据帧并保持相同的结构。
wordstoremove <- c("ai", "computing", "ulitzer", "ibm", "privacy", "cognitive")
a
id text time username
1 "ai and x" 10 "me"
2 "and computing" 5 "you"
3 "nothing" 15 "everyone"
4 "ibm privacy" 0 "know"
I was thinking something like: 我想的是:
a2 <- apply(a, 1, gsub(wordstoremove, "", a)
but clearly this doesnt work, before converting back to a data frame. 但在转换回数据框之前,这显然不起作用。
wordstoremove <- c("ai", "computing", "ulitzer", "ibm", "privacy", "cognitive")
(dat <- read.table(header = TRUE, text = 'id text time username
1 "ai and x" 10 "me"
2 "and computing" 5 "you"
3 "nothing" 15 "everyone"
4 "ibm privacy" 0 "know"'))
# id text time username
# 1 1 ai and x 10 me
# 2 2 and computing 5 you
# 3 3 nothing 15 everyone
# 4 4 ibm privacy 0 know
(dat1 <- as.data.frame(sapply(dat, function(x)
gsub(paste(wordstoremove, collapse = '|'), '', x))))
# id text time username
# 1 1 and x 10 me
# 2 2 and 5 you
# 3 3 nothing 15 everyone
# 4 4 0 know
Another option using dplyr::mutate()
and stringr::str_remove_all()
: 使用dplyr::mutate()
和stringr::str_remove_all()
另一个选项:
library(dplyr)
library(stringr)
dat <- dat %>%
mutate(text = str_remove_all(text, regex(str_c("\\b",wordstoremove, "\\b", collapse = '|'), ignore_case = T)))
Because lowercase 'ai' could easily be a part of a longer word, the words to remove are bound with \\\\b
so that they are not removed from the beginning, middle, or end or other words. 因为小写'ai'很容易成为较长单词的一部分,所以要删除的单词与\\\\b
绑定,以便它们不会从开头,中间或结尾或其他单词中删除。
The search pattern is also wrapped with regex(pattern, ignore_case = T)
in case some words are capitalized in the text string. 如果某些单词在文本字符串中大写, regex(pattern, ignore_case = T)
搜索模式也用regex(pattern, ignore_case = T)
包装。
str_replace_all()
could be used if you wanted to replace the words with something other than just removing them. 如果你想用除了删除它们之外的东西替换单词,可以使用str_replace_all()
。 str_remove_all()
is just an alias for str_replace_all(string, pattern, '')
. str_remove_all()
只是str_replace_all(string, pattern, '')
的别名。
rawr's anwswer could be updated to: rawr的anwswer可以更新为:
dat1 <- as.data.frame(sapply(dat, function(x)
gsub(paste0('\\b', wordstoremove, '\\b', collapse = '|'), '', x, ignore.case = T)))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.