简体   繁体   中英

Extract text between variable delimiters

I have text with large numbers of special characters that I want to extract certain substrings from:

y <- c("some stuff <rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep> some stuff <#> <dir> where is Londonderry?</dir>",
       "some stuff <rep> <[> But it 's 1nOt an overflow of Belfast% </rep> <#> potentially more stuff <rep> I 1lIved in Lisburn </rep>",
       "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa> blah blub <icu> Yeah </icu>")

I want to extract whatever comes between 'tag' like substrings, either <dir>...</dir> or <rep>...</rep> or <icu>...</icu> and so on:

With this regex I'm modestly successful:

library(stringr)
lapply(y, function(x) paste0(unlist(str_extract_all(x, "<([a-z]{3})>(?!<\\1>).*</\\1>")), collapse = ", "))
[[1]]
[1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>, <dir> where is Londonderry?</dir>"

[[2]]
[1] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep> <#> potentially more stuff <rep> I 1lIved in Lisburn </rep>"

[[3]]
[1] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>, <icu> Yeah </icu>"

Just [[2]] isn't as expected: there is still unwanted material (namely <#> potentially more stuff ) and the two occurrences of <rep>...</rep> substrings are not separated by , . My hunch is that my regex fails here because the two tags are the same rather than different.

How can the regex be improved so that this expected result is obtained:

Expected result :

[[1]]
[1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>, <dir> where is Londonderry?</dir>"

[[2]]
[1] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep>, <rep> I 1lIved in Lisburn </rep>"

[[3]]
[1] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>, <icu> Yeah </icu>"

EDIT :

I've found a viable solution in the meantime:

lapply(y, function(x) paste0(unlist(str_extract_all(x, "<([a-z]{3})>.*?</\\1>")), collapse = ", "))

How about this?

unlist(str_extract_all(y, "\\<([A-Za-z0-9_]+\\>).*?(\\<\\/\\1)"))

# [1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>" "<dir> where is Londonderry?</dir>"                         
# [3] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep>"    "<rep> I 1lIved in Lisburn </rep>"                          
# [5] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>"    "<icu> Yeah </icu>"     

Basically all we're doing here is putting the (opening) tag's body (+ its tailing angular bracket) in a capture group, and using that capture group to define the closing tag as well. Then we capture everything between those two instances of said capture group(s). So something like: <(tag>)whatever<\\1 where \1 is tag> .

Edit:

I guess this should do it:

lapply(str_extract_all(y, "\\<([A-Za-z0-9]+)\\>.*?\\<\\/\\1\\>"), paste, collapse = ", ")

# [[1]]
# [1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>, <dir> where is Londonderry?</dir>"

# [[2]]
# [1] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep>, <rep> I 1lIved in Lisburn </rep>"

# [[3]]
# [1] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>, <icu> Yeah </icu>"
library(gsubfn)
a1 <- strapplyc(y, "<dir>(.*?)</dir>", simplify = c)
a2 <- strapplyc(y, "<rep>(.*?)</rep>", simplify = c)
a3 <- strapplyc(y, "<icu>(.*?)</icu>", simplify = c)

a1
a2
a3

# output:
> a1
[1] " where is Londonderry?"
> a2
[1] " I 1knOw 2LondondErry is bigger than 2LIsburn% " " <[> But it 's 1nOt an overflow of Belfast% "   
[3] " I 1lIved in Lisburn "                          
> a3
[1] " Yeah "

If I understood your problem corectly, this is a possible solution (I use the rebus package a lot for regex related problems - the result is a conventional regex):

library(dplyr)
library(rebus)
library(stringi)
library(purrr)

y <- c("some stuff <rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep> some stuff <#> <dir> where is Londonderry?</dir>",
       "some stuff <rep> <[> But it 's 1nOt an overflow of Belfast% </rep> <#> potentially more stuff <rep> I 1lIved in Lisburn </rep>",
       "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa> blah blub <icu> Yeah </icu>")

pattern <- "<" %R% ANY_CHAR %R% ANY_CHAR %R% ANY_CHAR %R% ">" %R% ".*?" %R% "<" %R% "/" %R% ANY_CHAR %R% ANY_CHAR %R% ANY_CHAR %R% ">" 

stringi::stri_extract_all_regex(y ,pattern, simplify = FALSE) %>% 
  purrr::map(~paste0(.x, collapse = ", "))

[[1]]
[1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>, <dir> where is Londonderry?</dir>"

[[2]]
[1] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep>, <rep> I 1lIved in Lisburn </rep>"

[[3]]
[1] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>, <icu> Yeah </icu>"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM