简体   繁体   English

R.基于数组的数据帧中字符串匹配的替换

[英]R. Array-based replacement of string matches in data frame

I have a data frame column containing sentences. 我有一个包含句子的数据框列。 Within these sentences, there's the whole host of words which I want to remove. 在这些句子中,有很多我想删除的单词。

These are words that could appear more than once in a single sentence, and when found I want to remove these words entirely. 这些单词在一个句子中可能出现不止一次,当我发现它们时,我希望将其完全删除。

eg Sample list of words for removal: ("the", "and", "a") * (list will have 100's of words) 例如,要删除的单词的示例列表:(“,”和“,”,“ a”)*(列表中包含100个单词)

String Before: "the quick brown fox jumps over the lazy dog and cat" String After: "quick brown fox jumps over lazy dog cat" 之前的字符串:“快速的棕色狐狸跳过懒惰的狗和猫”之前的字符串:“快速的棕色狐狸跳过懒惰的狗和猫”


 sentences <- as.data.frame(c("it's a new sentence","another sentence i've constructed","and a third sentence"))
 colnames(sentences) <- c("sentence")

stop_words <- list( "i" = '', "a" = "", "me" = '' , "my" = "", "myself" = "", "we" = "", "it's" = "", "a" = "", "i've" = "")

 stop_pattern <- paste0("\\b", "(", paste0(stop_words, collapse = "|"),")","\\b")
 trimws(gsub("\\s{2}", " ", gsub(stop_pattern, "", sentences$sentence)))

Output should remove words such as "I've" from the above sentences, however fails to do so. 输出应从上述句子中删除“ I've”之类的词,但不能这样做。

Output is as shows: [1] "it's a new sentence" "another sentence i've constructed" "and a third sentence" 输出如下所示:[1]“这是一个新句子”“我构建的另一个句子”“和第三个句子”

尝试:

stop_pattern <- paste0("\\b", "(", paste0(stop_words, collapse = "|"),")","\\b") trimws(gsub("\\s{2}", " ", gsub(stop_pattern, "", sentences)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM