简体   繁体   中英

R regular expression match/omit several repeats

I'm using back references to get rid of accidental repeats in vectors of variable names. The names in the first case I encountered have repeat patterns like this

x <- c("gender_gender-1", "county_county-2", "country_country-1997",
       "country_country-1993")

The repeats were always separated by underscore and there was only one repeat to eliminate. And they always start at the beginning of the text. After checking the Regular Expression Cookbook, 2ed, I arrived at an answer that works:

> gsub("^(.*?)_\\1", "\\1", x)
[1] "gender-1"     "county-2"     "country-1997" "country-1993"

I was worried that the future cases might have dash or space as separator, so I wanted to generalize the matching a bit. I got that worked out as well.

> x <- c("gender_gender-1", "county-county-2", "country country-1997",
+       "country,country-1993")
> gsub("^(.*?)[,_\ -]\\1", "\\1", x)
[1] "gender-1"     "county-2"     "country-1997" "country-1993"

So far, total victory.

Now, what is the correct fix if there are three repeats in some cases? In this one, I want "country-country-country" to become just one "country".

> x <- c("gender_gender-1", "county-county-county-2")
> gsub("^(.*?)[,_\ -]\\1", "\\1", x)
[1] "gender-1"        "county-county-2"   

I am willing to replace all of the separators by "_" if that makes it easier to get rid of the repeat words.

You may quantify the [,_ -]\\1 part:

gsub("^(.*?)(?:[,_\\s-]\\1)+", "\\1", x, perl=TRUE)

See the R demo

Note I also replace the space with \\s to match any whitespace (and this requires perl=TRUE ). You may also match any whitespace with [:space:] , then you do not need perl=TRUE , ie gsub("^(.*?)(?:[,_[:space:]-]\\\\1)+", "\\\\1", x) .

Details :

  • ^ - matches the start of a string
  • (.*?) - any 0+ chars as few as possible up to the first...
  • (?:
    • [,_\\\\s-] - , , _ , whitespace or -
    • \\\\1 - same value as captured in Group 1
  • )+ - 1 or more times.

If you only want to match the repeat part 1 or 2 times, replace + with {1,2} limiting quantifier:

gsub("^(.*?)(?:[,_\\s-]\\1){1,2}", "\\1", x, perl=TRUE)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM