简体   繁体   中英

Capture after positive lookbehind except is string contains exclusions in R

Suppose I have the following strings to parse for use in R :

PIA+1+TC_5504_00_312010_0050+50103 AB346 type 5334
PIA+1+TC_3444_00_312010_0133+0140
PIA+1+DRW/50488665600/01/000:DW
PIA+1+TC_5635_00_312019_2644+LoremIpsum
PIA+1+TC_5635_00_312010_0040+63503 AB346 type 5334
PIA+1+TC_5635_00_312018_0223+DolorSit
PIA+1+TC_5635_00_312019_2644+DolorSit

In a nutshell, the logic is this:

  1. Capture everything after the positive lookbehind (?<=^PIA\\+\\d{1}\\+)
  2. Do not capture where the string contains either LoremIpsum or DolorSit

The desired output should look like this:

TC_5504_00_312010_0050+50103 AB346 type 5334
TC_3444_00_312010_0133+0140
DRW/50488665600/01/000:DW
TC_5635_00_312010_0040+63503 AB346 type 5334

This is where I got to thus far:

(?<=^PIA\+\d{1}\+)((?!.*LoremIpsum|DolorSit).)*$

I have deliberately not escaped according to R yet but will do so later.

But unfortunately it keeps the last digit of the capture into a new group. My experience in regex is limited and I am stuck how to capture the entire remaining string in a group. I am not bound to this approach and another logic could be, that whenever the last + is followed by a letter, it should exclude the capture but that would result in difficulties adding occasions with the PIA+1+DRW/50488665600/01/000:DW strings.

My samples are stored here regex101.com

You can get rid of the inefficient tempered greedy token completely by replacing it with a mere alternation group:

(?<=^PIA\+\d\+)(?!.*(?:LoremIpsum|DolorSit)).*$

See the regex demo

Details

  • (?<= - a positive lookbehind that, immediately to the left of the current location, requires
    • ^ - start of string
    • PIA\\+ - PIA+
    • \\d - one digit
    • \\+ - a +
  • ) - end of the lookbehind
  • (?!.*(?:LoremIpsum|DolorSit)) - a negative lookahead that fails the match if there are any zero or more chars other than line break chars as many as possible and then followed with LoremIpsum or DolorSit substrings immediately to the right of the current location
  • .*$ - rest of the line.

See an R demo :

library(stringr)
x <- c("PIA+1+TC_5504_00_312010_0050+50103 AB346 type 5334",
"PIA+1+TC_3444_00_312010_0133+0140",
"PIA+1+DRW/50488665600/01/000:DW",
"PIA+1+TC_5635_00_312019_2644+LoremIpsum",
"PIA+1+TC_5635_00_312010_0040+63503 AB346 type 5334",
"PIA+1+TC_5635_00_312018_0223+DolorSit",
"PIA+1+TC_5635_00_312019_2644+DolorSit")
str_extract(x, "(?<=^PIA\\+\\d\\+)(?!.*(?:LoremIpsum|DolorSit)).*$")

Output:

[1] "TC_5504_00_312010_0050+50103 AB346 type 5334"
[2] "TC_3444_00_312010_0133+0140"                 
[3] "DRW/50488665600/01/000:DW"                   
[4] NA                                            
[5] "TC_5635_00_312010_0040+63503 AB346 type 5334"
[6] NA                                            
[7] NA  

Or, you may also just get the matching strings using

res <- sub("^PIA\\+\\d\\+(?!.*(?:LoremIpsum|DolorSit))(.*)|.*", "\\1", x, perl=TRUE)
res[res != ""]

See this R demo .

Or, if you need to just grep these strings:

grep("^PIA\\+\\d\\+(?!.*(?:LoremIpsum|DolorSit))", x, perl=TRUE, value=TRUE)

See this R demo . Output:

[1] "PIA+1+TC_5504_00_312010_0050+50103 AB346 type 5334"
[2] "PIA+1+TC_3444_00_312010_0133+0140"                 
[3] "PIA+1+DRW/50488665600/01/000:DW"                   
[4] "PIA+1+TC_5635_00_312010_0040+63503 AB346 type 5334"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM