简体   繁体   中英

Regular expression in R - extract only match

My strings look like as follows:

crb_gdp_g_100000_16_16_ftv_all.txt
crb_gdp_g_100000_16_20_fweo2_all.txt
crb_gdp_g_100000_4_40_fweo2_galt_1.txt

I only want to extract the part between f and the following underscore (in these three cases "tv", "weo2" and "weo2").

My regular expression is:

regex.f = "_f([[:alnum:]]+)_"

There is no string with more than one part matching the pattern. Why does the following command not work?

sub(regex.f, "\\1", "crb_gdp_g_100000_16_16_ftv_all.txt")

The command only removes "_f" from the string and returns the remaining string.

Can easily be achived with qdapRegex

df <- c("crb_gdp_g_100000_16_16_ftv_all.txt", 
"crb_gdp_g_100000_16_20_fweo2_all.txt", 
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt")

library(qdapRegex)
rm_between(df, "_f", "_", extract=TRUE)

We can use sub extract the strings by matching the character f followed by one or more characters that are not an underscore or numbers ( [^_0-9]+ ), capture as a group ( (...) ), followed by 0 or more numbers ( \\\\d* ) followed by an _ and other characters. Replace with the backreference ( \\\\1 ) of the captured group

sub(".*_f([^_0-9]+)\\d*_.*", "\\1", str1)
#[1] "tv"  "weo" "weo"

data

str1 <- c("crb_gdp_g_100000_16_16_ftv_all.txt", 
    "crb_gdp_g_100000_16_20_fweo2_all.xml",
     "crb_gdp_g_100000_4_40_fweo2_galt_1.txt")

My usual regex for extracting the text between two characters comes from https://stackoverflow.com/a/13499594/1017276 , which specifically looks at extracting text between parentheses. This approach only changes the parentheses to f and _ .

x <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
       "crb_gdp_g_100000_16_20_fweo2_all.xml",
       "crb_gdp_g_100000_4_40_fweo2_galt_1.txt",
       "crb_gdp_g_100000_20_tbf_16_nqa_8_flin_galt_2.xml")

regmatches(x,gregexpr("(?<=_f).*?(?=_)", x, perl=TRUE))

Or with the stringr package.

library(stringr)

str_extract(x, "(?<=_f).*?(?=_)")

edited to start the match on _f instead of f .

NOTE

akrun's answer runs a few milliseconds faster than the stringr approach, and about ten times faster than the base approach. The base approach clocks in at about 100 milliseconds for a character vector of 10,000 elements.

update: capture match using str_match

library(stringr)  
m <- str_match("crb_gdp_g_100000_16_20_fweo2_all.txt", "_f([[:alnum:]]+)_")
print(m[[2]])
# weo2

your regex not work because missing starting and ending match .* and use \\w for shorthand [:alnum:]

sub(".*_f(\\w+?)_.*", "\\1", "crb_gdp_g_100000_16_20_fweo2_all.txt")

We could use the package unglue :

library(unglue)
txt <- c("crb_gdp_g_100000_16_16_ftv_all.txt", 
       "crb_gdp_g_100000_16_20_fweo2_all.txt", 
       "crb_gdp_g_100000_4_40_fweo2_galt_1.txt")

pattern <-
  "crb_gdp_g_100000_{=\\d+}_{=\\d+}_f{x}_{=.+?}.txt"
unglue_vec(txt,pattern)
#> [1] "tv"   "weo2" "weo2"

Created on 2019-10-09 by the reprex package (v0.3.0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM