替换包含模式的整个单词 - gsub 和 R

Question

I am trying to clean some garbage out of some text.我正在尝试从一些文本中清除一些垃圾。 While doing this, I am assuming that any word that has a letter (any letter) repeated three or more times is garbage - and I want to remove it.在这样做时，我假设任何有一个字母（任何字母）重复三次或更多次的单词都是垃圾 - 我想删除它。

I've come up with this:我想出了这个：

gsub(pattern = "[a-zA-Z]\\1\\1", replacement = "", string)

in which string is the character vector, but this doesn't work.其中string是字符向量，但这不起作用。 Everything else I've tried might find the pattern, but it just removes the pattern, leaving a mess.我尝试过的所有其他事情都可能找到模式，但它只是删除了模式，留下了一团糟。 I'm trying to remove the whole word with the pattern in it.我正在尝试删除带有模式的整个单词。

Any ideas?有任何想法吗？

Answer 1

You need to assign a "capture group" to the [.] class by wrapping it in parens, since the \\1 needs something to reference:您需要将“捕获组”分配给[.] class，方法是将其包装在括号中，因为\\1需要参考：

gsub("([a-zA-Z])\\1\\1", "", "aabbbccdddee")
# [1] "aaccee"

Answer 2

You need你需要

gsub("\\s*[[:alpha:]]*([[:alpha:]])\\1{2}[[:alpha:]]*", "", string)
gsub("\\s*\\p{L}*(\\p{L})\\1{2}\\p{L}*", "", string, perl=TRUE)
stringr::str_replace_all(string, "\\s*\\p{L}*(\\p{L})\\1{2}\\p{L}*", "")

See an R demo :请参阅R 演示：

string <- "This is a baaaad unnnnecessary short word"
gsub("\\s*[[:alpha:]]*([[:alpha:]])\\1{2}[[:alpha:]]*", "", string)
gsub("\\s*\\p{L}*(\\p{L})\\1{2}\\p{L}*", "", string, perl=TRUE)
library(stringr)
str_replace_all(string, "\\s*\\p{L}*(\\p{L})\\1{2}\\p{L}*", "")

All yielding [1] "This is a short word" .全部产生[1] "This is a short word" 。

See the regex demo .请参阅正则表达式演示。 Regex details :正则表达式详细信息：

\s* - zero or more whitespaces \s* - 零个或多个空格
\p{L}* / [[:alpha:]]* - zero or more letters \p{L}* / [[:alpha:]]* - 零个或多个字母
(\p{L}) - Capturing group 1: any single letter (\p{L}) - 捕获组 1：任何单个字母
\1{2} - two occurrences of the same value as in Group 1 \1{2} - 两次出现与第 1 组中相同的值
\p{L}* / [[:alpha:]]* - zero or more letters. \p{L}* / [[:alpha:]]* - 零个或多个字母。

Answer 3

r2evans example with different regex:具有不同正则表达式的 r2evans 示例：

gsub("(\\w)\\1{2, }", "", "aabbbccdddee")

[1] "aaccee"

替换包含模式的整个单词 - gsub 和 R

问题描述

3 个解决方案

解决方案1
1 2022-01-07 22:11:44

解决方案2
1 2022-01-07 22:30:44

解决方案3
0 2022-01-07 22:21:15

替换包含模式的整个单词 - gsub 和 R

问题描述

3 个解决方案

解决方案1 1 2022-01-07 22:11:44

解决方案2 1 2022-01-07 22:30:44

解决方案3 0 2022-01-07 22:21:15

解决方案1
1 2022-01-07 22:11:44

解决方案2
1 2022-01-07 22:30:44

解决方案3
0 2022-01-07 22:21:15