简体   繁体   中英

Keep text between 2nd dash and first flash in R

I have a vector of strings that look like this:

a - bc/def_g  - A/mn/us/ww
opq - rs/ts_uf - BC/wx/yza
Abc - so/dhie7u - XYZ/En/xy/jkq - QWNE

I'd like to get the text after 2nd dash (-) but before first flash (/), ie the result should look like

A
BC
XYZ

What is the best way to do it (the vector has more than 500K rows.)

Thanks

Suppose your string is defined like this:

string <- c("a - bc/def_g  - A/mn/us/ww", 
            "opq - rs/ts_uf - BC/wx/yza", 
            "Abc - so/dhie7u - XYZ/En/xy/jkq - QWNE")

Then you can use sub

> sub(".*\\-\\s+([A-Z]+)/.*", "\\1", string)
[1] "A"   "BC"  "XYZ"

See regex in use here

^[^-]*-[^-]*-\s*\K[^/]+
  • ^ Assert position at the start of the line
  • [^-]* Match any character except - any number of times
  • - Match this literally
  • [^-]* Match any character except - any number of times
  • - Match this literally
  • \\s* Match any number of whitespace characters
  • \\K Resets the starting point of the pattern. Any previously consumed characters are no longer included in the final match
  • [^/]+ Match any character except / one or more times

Alternatively, as suggested by Jan in the comments below (I believe it has since been deleted) ^(?:\\[^-\\]*-){2}\\s*\\K\\[^/\\]+ may be used. It's shorter and easily scalable, but more adds steps.

See code in use here

x <- c("a - bc/def_g  - A/mn/us/ww", "opq - rs/ts_uf - BC/wx/yza", "Abc - so/dhie7u - XYZ/En/xy/jkq - QWNE")
m <- regexpr("^[^-]*-[^-]*-\\s*\\K[^/]+", x, perl=T)
regmatches(x, m)

Result: [1] "A" "BC" "XYZ"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM