简体   繁体   中英

(How) can multiple backreference be used in alternation patterns?

This question is a spin-off from that question Function to count of consecutive digits in a string vector .

Assume I have strings such as x :

x <- c("555123", "57333", "21112", "12345", "22144", "44440")

and want to detect those strings where any number between 2 and 5 occurs in immediate duplication as many times as itself. That is, match if the string contains 22 , 333 , 4444 , and 55555 .

If I approach this task in small chunks using backreference, everything is fine:

str_detect(x, "(2)\\1{1}")
[1] FALSE FALSE FALSE FALSE  **TRUE** FALSE

str_detect(x, "(3)\\1{2}")
[1] FALSE  **TRUE** FALSE FALSE FALSE FALSE

str_detect(x, "(4)\\1{3}")
[1] FALSE FALSE FALSE FALSE FALSE  **TRUE**

However, if I pursue a single solution for all matches using a vector with the allowed numbers:

digits <- 2:5

and an alternation pattern, such as this:

patt <- paste0("(", digits, ")\\1{", digits - 1, "}", collapse = "|")
patt
[1] "(2)\\1{1}|(3)\\1{2}|(4)\\1{3}|(5)\\1{4}"

and input patt into str_detect , this only detects the first alternative, namely (2)\\1{1} :

str_detect(x, patt)
[1] FALSE FALSE FALSE FALSE  **TRUE** FALSE 

Is it the backreference which cannot be used in alternation patterns? If so, then why does a for loop iterating through each option separately not work either?

res <- c()
for(i in 2:5){
  res <- str_detect(x, paste0("(", i, ")\\1{", i - 1, "}"))
}
res
[1] FALSE FALSE FALSE FALSE FALSE FALSE

Advice on this matter is greatly appreciated!

What about this?

> grepl(
+   paste0(sapply(2:5, function(i) sprintf("(%s)\\%s{%s}", i, i - 1, i - 1)), collapse = "|"),
+   x
+ )
[1] FALSE  TRUE FALSE FALSE  TRUE  TRUE

or

> rowSums(sapply(2:5, function(i) grepl(sprintf("(%s)\\1{%s}", i, i - 1), x))) > 0
[1] FALSE  TRUE FALSE FALSE  TRUE  TRUE

As mentioned in the comments, you need to update the regex:

patt = paste0(
  "(", digits, ")\\", digits - 1, "{", digits - 1, "}", 
  collapse = "|"
)
str_detect(x, patt)

Output:

[1] FALSE  TRUE FALSE FALSE  TRUE  TRUE

In your for loop, you are replacing res each time so when you print res at the end, you are seeing the result for when i is 5. If you use print() instead:

for(i in 2:5){
  print(str_detect(x, paste0("(", i, ")\\1{", i - 1, "}")))
}

Output:

[1] FALSE FALSE FALSE FALSE  TRUE FALSE
[1] FALSE  TRUE FALSE FALSE FALSE FALSE
[1] FALSE FALSE FALSE FALSE FALSE  TRUE
[1] FALSE FALSE FALSE FALSE FALSE FALSE

If you wanted to use a loop:

map_lgl(x, function(str) {
  any(map_lgl(
    2:5, 
    ~ str_detect(str, paste0("(", .x, ")\\1{", .x - 1, "}"))
  ))
})

Output:

[1] FALSE  TRUE FALSE FALSE  TRUE  TRUE

In your pattern (2)\\1{1}|(3)\\1{2}|(4)\\1{3}|(5)\\1{4} the quantifier repeats matching the backreference to the first capture group. That is why you only match the first alternative.

You could repeat the next capture group instead as there are multiple groups.

(2)\\1{1}|(3)\\2{2}|(4)\\3{3}|(5)\\4{4}

The (2)\\1{1} can be just (2)\\1 but this is ok as you assembling the pattern dynamically

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM