简体   繁体   中英

Trying to use both parentheses of capture and selection in regex gsub R

I want to eliminate som text in order to capture a key in a larger string, like:

text <- c("BLASTBLA ST-ASBA/aGABN/1234/BLA BLABLABLABLABLA", "BLA BSTLA ST ASBA/aGABN/1234/BLA BLABLASTBLABLA",
          "BLABLSTAAAaa ST  ASBA/aGABN/1234/BLA BLABLABLSTABLA")

desired result:

c("ST-ASBA/aGABN/1234/BLA BLABLABLABLABLA", "ST ASBA/aGABN/1234/BLA BLABLABLABLA",
          "ST  ASBA/aGABN/1234/BLA BLABLABLABLA")

I'm trying two kind of uses of parentheses here: capture and selection. The first in order to eliminate the undesired part of text and the second in order to flexibilize the pattern recognize of the "ST" part. This is what I'm trying:

gsub("(.*\\s+)ST(\\-|\\s+).*", "", text)
[1] "" ""

How I could remove the first part of text?

We could use the capture group and in the replacement specify the backreference. Here we match characters ( .* ) can capture the characters that start with 'ST' followed by other characters ( .* ). In the replacement, specify the backreference of the group ( \\\\1 )

sub(".*\\b(ST[-\\t ].*)", "\\1", text)
#[1] "ST-ASBA/aGABN/1234/BLA BLABLABLABLABLA" "ST ASBA/aGABN/1234/BLA BLABLABLABLA"  
#[3] "ST  ASBA/aGABN/1234/BLA BLABLABLABLA"  

Your regex pattern matches the whole string, but your replacement is an empty string. gsub replaces all the matches, so it replaces the whole string with the empty string.

You may use

sub(".*?\\s+(ST[[:space:]-])", "\\1", text)
sub(".*?\\s+(ST[\\s-])", "\\1", text, perl=TRUE)
sub(".*?(?=\\s+(?:ST[\\s-]))", "", text, perl=TRUE)

See the online R demo

Details

  • .*?\\\\s+(ST[[:space:]-]) - a TRE regex matching
    • .*? - any 0+ chars, as few as possible
    • \\s+ - 1+ whitespaces
    • (ST[[:space:]-]) - Group 1 (later referred to with \\1 ): ST substring followed with a whitespace or - char
  • .*?\\\\s+(ST[\\\\s-]) - a PCRE regex matching
    • .*? - any 0+ chars other than line break chars, as few as possible
    • \\s+ - 1+ whitespaces
    • (ST[\\s-]) - Group 1 (later referred to with \\1 ): ST substring followed with a whitespace or - char
  • .*?(?=\\\\s+(?:ST[\\\\s-])) - a lookahead based PCRE pattern where (?=...) is a positive lookahead that checks for its pattern match immediately to the right of the current position but does not consume the text. It is identical to the one above, but needs no backreference in the replacement pattern (as the text matched is not added to the match buffer).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM