简体   繁体   中英

gsub, lookahead and lookbehind

I have a string vector containing:

Number of source1.2_SPNB.txt
Number of source1.1_SPNB.txt
Number of source1.3_SPNB.txt

I need to extract "source1.1", "source1.2" and "source1.3" in a new vector.

Following this , I tried:

gsub("(?<=of )(.*)(?=_)", "\\1", string.vector)

But I get an error:

invalid regular expression '(?<=of )(.*)(?=_)', reason 'Invalid regexp'

I then tried:

gsub("(?<=of )(.*)(?=_)", "\\1", string.vector, perl = TRUE)

But it returned the exact same string vector.

What am I doing wrong?

There are several problems:

  • perl = TRUE is needed to use lookahead/lookbehind

  • even if we use that what the regular expression is doing is just replacing the desired substring with itself -- what we want to do is match the entire string (as opposed to using zero width lookahead/lookbehind) and then replace the entire string with just the portion matching the capture group.

  • there is presumably only one substitution required so sub , not gsub , should be used

Fixing these problems we get:

sub(".*(source.*?)_.*", "\\1", string.vector)

We could match character until the space ( .*\\\\s ) or ( | ) a _ followed by other characters ( .* ) and replace it with blank ( "" )

gsub(".*\\s|_.*", "", string.vector)
#[1] "source1.2" "source1.1" "source1.3"

Or if we need with capture groups, then

sub(".*\\sof\\s([^_]+).*", "\\1", string.vector)
#[1] "source1.2" "source1.1" "source1.3"

For extraction purpose, it may be better to use str_extract from stringr or the regmatches/regexpr from base R

regmatches(string.vector, regexpr("(?<=of )([^_]+)(?=_)", string.vector, perl = TRUE))
#[1] "source1.2" "source1.1" "source1.3"

data

string.vector <- c("Number of source1.2_SPNB.txt", "Number of source1.1_SPNB.txt", 
             "Number of source1.3_SPNB.txt")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM