简体   繁体   中英

How to subset a string in R

Dear all I have a vector of strings like:

LOCAT01PE
WECAT013EJD
AFECAT0155DR

I want to subset each value obtain only CAT and all the number after:

CAT01
CAT013
CAT0155

I have tried to use the command substr but it won't work since the quantity before the word CAT is not fixed and the numbers after CAT are not fixed.

In base R, we can use sub to extract "CAT" followed by numbers.

x <- c('LOCAT01PE', 'WECAT013EJD', 'AFECAT0155DR')
sub('..*(CAT\\d+).*', '\\1', x)
#[1] "CAT01"   "CAT013"  "CAT0155"

Or similar with str_extract

stringr::str_extract(x, "CAT\\d+")

We can also use substr with regexpr to identify relevant start/stop points in the string:

substr(vec,
       start = regexpr('CAT', vec),
       stop = regexpr('\\d[a-zA-Z]', vec)
       )

Output:

[1] "CAT01"   "CAT013"  "CAT0155"

We can use regexpr/regmatches in base R . It matches the word 'CAT' followed by - if there is any ? and one or more digits ( \\\\d+ )

regmatches(x, regexpr("CAT-?\\d+", x))
#[1] "CAT01"    "CAT013"   "CAT0155"  "CAT-01"   "CAT-013"  "CAT-0155"

data

x <- c('LOCAT01PE', 'WECAT013EJD', 'AFECAT0155DR', 
    'LO-CAT-01PE', 'WE-CAT-013-EJD', 'AFE-CAT-0155-DR')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM