I have text with large numbers of special characters that I want to extract certain substrings from:
y <- c("some stuff <rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep> some stuff <#> <dir> where is Londonderry?</dir>",
"some stuff <rep> <[> But it 's 1nOt an overflow of Belfast% </rep> <#> potentially more stuff <rep> I 1lIved in Lisburn </rep>",
"<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa> blah blub <icu> Yeah </icu>")
I want to extract whatever comes between 'tag' like substrings, either <dir>...</dir>
or <rep>...</rep>
or <icu>...</icu>
and so on:
With this regex I'm modestly successful:
library(stringr)
lapply(y, function(x) paste0(unlist(str_extract_all(x, "<([a-z]{3})>(?!<\\1>).*</\\1>")), collapse = ", "))
[[1]]
[1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>, <dir> where is Londonderry?</dir>"
[[2]]
[1] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep> <#> potentially more stuff <rep> I 1lIved in Lisburn </rep>"
[[3]]
[1] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>, <icu> Yeah </icu>"
Just [[2]]
isn't as expected: there is still unwanted material (namely <#> potentially more stuff
) and the two occurrences of <rep>...</rep>
substrings are not separated by ,
. My hunch is that my regex fails here because the two tags are the same rather than different.
How can the regex be improved so that this expected result is obtained:
Expected result :
[[1]]
[1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>, <dir> where is Londonderry?</dir>"
[[2]]
[1] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep>, <rep> I 1lIved in Lisburn </rep>"
[[3]]
[1] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>, <icu> Yeah </icu>"
EDIT :
I've found a viable solution in the meantime:
lapply(y, function(x) paste0(unlist(str_extract_all(x, "<([a-z]{3})>.*?</\\1>")), collapse = ", "))
How about this?
unlist(str_extract_all(y, "\\<([A-Za-z0-9_]+\\>).*?(\\<\\/\\1)"))
# [1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>" "<dir> where is Londonderry?</dir>"
# [3] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep>" "<rep> I 1lIved in Lisburn </rep>"
# [5] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>" "<icu> Yeah </icu>"
Basically all we're doing here is putting the (opening) tag's body (+ its tailing angular bracket) in a capture group, and using that capture group to define the closing tag as well. Then we capture everything between those two instances of said capture group(s). So something like: <(tag>)whatever<\\1
where \1
is tag>
.
Edit:
I guess this should do it:
lapply(str_extract_all(y, "\\<([A-Za-z0-9]+)\\>.*?\\<\\/\\1\\>"), paste, collapse = ", ")
# [[1]]
# [1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>, <dir> where is Londonderry?</dir>"
# [[2]]
# [1] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep>, <rep> I 1lIved in Lisburn </rep>"
# [[3]]
# [1] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>, <icu> Yeah </icu>"
library(gsubfn)
a1 <- strapplyc(y, "<dir>(.*?)</dir>", simplify = c)
a2 <- strapplyc(y, "<rep>(.*?)</rep>", simplify = c)
a3 <- strapplyc(y, "<icu>(.*?)</icu>", simplify = c)
a1
a2
a3
# output:
> a1
[1] " where is Londonderry?"
> a2
[1] " I 1knOw 2LondondErry is bigger than 2LIsburn% " " <[> But it 's 1nOt an overflow of Belfast% "
[3] " I 1lIved in Lisburn "
> a3
[1] " Yeah "
If I understood your problem corectly, this is a possible solution (I use the rebus package a lot for regex related problems - the result is a conventional regex):
library(dplyr)
library(rebus)
library(stringi)
library(purrr)
y <- c("some stuff <rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep> some stuff <#> <dir> where is Londonderry?</dir>",
"some stuff <rep> <[> But it 's 1nOt an overflow of Belfast% </rep> <#> potentially more stuff <rep> I 1lIved in Lisburn </rep>",
"<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa> blah blub <icu> Yeah </icu>")
pattern <- "<" %R% ANY_CHAR %R% ANY_CHAR %R% ANY_CHAR %R% ">" %R% ".*?" %R% "<" %R% "/" %R% ANY_CHAR %R% ANY_CHAR %R% ANY_CHAR %R% ">"
stringi::stri_extract_all_regex(y ,pattern, simplify = FALSE) %>%
purrr::map(~paste0(.x, collapse = ", "))
[[1]]
[1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>, <dir> where is Londonderry?</dir>"
[[2]]
[1] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep>, <rep> I 1lIved in Lisburn </rep>"
[[3]]
[1] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>, <icu> Yeah </icu>"
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.