简体   繁体   English

如何用R删除字符串中重复的字符?

[英]How can I remove repeated characters in a string with R?

I would like to implement a function with R that removes repeated characters in a string. 我想用R实现一个删除字符串中重复字符的函数。 For instance, say my function is named removeRS , so it is supposed to work this way: 例如,假设我的函数名为removeRS ,因此它应该以这种方式工作:

  removeRS('Buenaaaaaaaaa Suerrrrte')
  Buena Suerte
  removeRS('Hoy estoy tristeeeeeee')
  Hoy estoy triste

My function is going to be used with strings written in spanish, so it is not that common (or at least correct) to find words that have more than three successive vowels. 我的函数将用于用西班牙语编写的字符串,因此找到具有三个以上连续元音的单词并不常见(或至少是正确的)。 No bother about the possible sentiment behind them. 不用担心他们背后可能存在的情绪。 Nonetheless, there are words that can have two successive consonants (especially ll and rr), but we could skip this from our function. 尽管如此,有些单词可以有两个连续的辅音(特别是ll和rr),但我们可以从我们的函数中跳过这个。

So, to sum up, this function should replace the letters that appear at least three times in a row with just that letter. 因此,总而言之,此函数应该替换仅与该字母连续出现至少三次的字母。 In one of the examples above, aaaaaaaaa is replaced with a . 在以上的实施例之一, aaaaaaaaa被替换为a

Could you give me any hints to carry out this task with R ? 你可以给我任何提示用R执行这项任务吗?

I did not think very carefully on this, but this is my quick solution using references in regular expressions: 我没有仔细考虑过这个,但这是我在正则表达式中使用引用的快速解决方案:

gsub('([[:alpha:]])\\1+', '\\1', 'Buenaaaaaaaaa Suerrrrte')
# [1] "Buena Suerte"

() captures a letter first, \\\\1 refers to that letter, + means to match it once or more; ()首先捕获一个字母, \\\\1表示该字母, +表示匹配一次或多个字母; put all these pieces together, we can match a letter two or more times. 把所有这些碎片放在一起,我们可以匹配一个字母两次或更多次。

To include other characters besides alphanumerics, replace [[:alpha:]] with a regex matching whatever you wish to include. 要包括除字母数字之外的其他字符,请将[[:alpha:]]替换为匹配任何内容的正则表达式。

I think you should pay attention to the ambiguities in your problem description. 我认为你应该注意问题描述中的含糊之处。 This is a first stab, but it clearly does not work with "Good Luck" in the manner you desire: 这是第一次尝试,但它显然不能以你想要的方式与“好运”一起工作:

removeRS <- function(str) paste(rle(strsplit(str, "")[[1]])$values, collapse="")
removeRS('Buenaaaaaaaaa Suerrrrte')
#[1] "Buena Suerte"

Since you want to replace letters that appear AT LEAST 3 times, here is my solution: 由于您要替换至少出现3次的字母,这是我的解决方案:

gsub("([[:alpha:]])\\1{2,}", "\\1", "Buennaaaa Suerrrtee")
#[1] "Buenna Suertee"

As you can see the 4 "a" have been reduced to only 1 a, the 3 r have been reduced to 1 r but the 2 n and the 2 e have not been changed. 正如您所看到的,4“a”已减少到仅1 a,3 r已减少到1 r,但2 n和2 e未被更改。 As suggested above you can replace the [[:alpha:]] by any combination of [a-zA-KM-Z] or similar, and even use the "or" operator | 如上所述,你可以用[a-zA-KM-Z]或类似的任意组合替换[[:alpha:]] ,甚至可以使用“或”运算符| inside the squre brackets [y|Q] if you want your code to affect only repetitions of y and Q. 如果希望代码只影响y和Q的重复,则在squre括号内[y|Q]

gsub("([a|e])\\1{2,}", "\\1", "Buennaaaa Suerrrtee")
# [1] "Buenna Suerrrtee"
# triple r are not affected and there are no triple e.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM