gsub, lookahead and lookbehind

Question

I have a string vector containing:

Number of source1.2_SPNB.txt
Number of source1.1_SPNB.txt
Number of source1.3_SPNB.txt

I need to extract "source1.1", "source1.2" and "source1.3" in a new vector.

Following this , I tried:

gsub("(?<=of )(.*)(?=_)", "\\1", string.vector)

But I get an error:

invalid regular expression '(?<=of )(.*)(?=_)', reason 'Invalid regexp'

I then tried:

gsub("(?<=of )(.*)(?=_)", "\\1", string.vector, perl = TRUE)

But it returned the exact same string vector.

What am I doing wrong?

Answer 1

There are several problems:

perl = TRUE is needed to use lookahead/lookbehind
even if we use that what the regular expression is doing is just replacing the desired substring with itself -- what we want to do is match the entire string (as opposed to using zero width lookahead/lookbehind) and then replace the entire string with just the portion matching the capture group.
there is presumably only one substitution required so sub , not gsub , should be used

Fixing these problems we get:

sub(".*(source.*?)_.*", "\\1", string.vector)

Answer 2

We could match character until the space ( .*\\\\s ) or ( | ) a _ followed by other characters ( .* ) and replace it with blank ( "" )

gsub(".*\\s|_.*", "", string.vector)
#[1] "source1.2" "source1.1" "source1.3"

Or if we need with capture groups, then

sub(".*\\sof\\s([^_]+).*", "\\1", string.vector)
#[1] "source1.2" "source1.1" "source1.3"

For extraction purpose, it may be better to use str_extract from stringr or the regmatches/regexpr from base R

regmatches(string.vector, regexpr("(?<=of )([^_]+)(?=_)", string.vector, perl = TRUE))
#[1] "source1.2" "source1.1" "source1.3"

string.vector <- c("Number of source1.2_SPNB.txt", "Number of source1.1_SPNB.txt", 
             "Number of source1.3_SPNB.txt")