I have a string vector containing:
Number of source1.2_SPNB.txt
Number of source1.1_SPNB.txt
Number of source1.3_SPNB.txt
I need to extract "source1.1", "source1.2" and "source1.3" in a new vector.
Following this , I tried:
gsub("(?<=of )(.*)(?=_)", "\\1", string.vector)
But I get an error:
invalid regular expression '(?<=of )(.*)(?=_)', reason 'Invalid regexp'
I then tried:
gsub("(?<=of )(.*)(?=_)", "\\1", string.vector, perl = TRUE)
But it returned the exact same string vector.
What am I doing wrong?
There are several problems:
perl = TRUE is needed to use lookahead/lookbehind
even if we use that what the regular expression is doing is just replacing the desired substring with itself -- what we want to do is match the entire string (as opposed to using zero width lookahead/lookbehind) and then replace the entire string with just the portion matching the capture group.
there is presumably only one substitution required so sub
, not gsub
, should be used
Fixing these problems we get:
sub(".*(source.*?)_.*", "\\1", string.vector)
We could match character until the space ( .*\\\\s
) or ( |
) a _
followed by other characters ( .*
) and replace it with blank ( ""
)
gsub(".*\\s|_.*", "", string.vector)
#[1] "source1.2" "source1.1" "source1.3"
Or if we need with capture groups, then
sub(".*\\sof\\s([^_]+).*", "\\1", string.vector)
#[1] "source1.2" "source1.1" "source1.3"
For extraction purpose, it may be better to use str_extract
from stringr
or the regmatches/regexpr
from base R
regmatches(string.vector, regexpr("(?<=of )([^_]+)(?=_)", string.vector, perl = TRUE))
#[1] "source1.2" "source1.1" "source1.3"
string.vector <- c("Number of source1.2_SPNB.txt", "Number of source1.1_SPNB.txt",
"Number of source1.3_SPNB.txt")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.