简体   繁体   中英

gsub replacing string with pattern matching code and not specific string variables

I have a long list of files that I want to standardize. Different components of the string are separated by an underscore. However, a large number of files were created without the underscore between the digits (a unique id) and the single alpha character. The specific variables will be different per file but the pattern is the same. How do I add the _ in?

I have tried gsub . It picks up the pattern correctly (only changes strings that need the change) but the replacement is the pattern matching code.

x<- c("A12_SITE_1234_J_vvv.csv","A12_SITA_1234J_vvv.csv", "A12_SITE_1678_H_vvv.csv", "A12_SITE_145C_vvv.csv")

z<- gsub(".*[0-9][A-Z]", ".*[0-9]\\_[A-Z]", x)

expected results:

"A12_SITE_1234_J_vvv.csv","A12_SITA_1234_J_vvv.csv", "A12_SITE_1678_H_vvv.csv", "A12_SITE_145_C_vvv.csv"

Current results:

"A12_SITE_1234_J_vvv.csv" ".*[0-9]_[A-Z]_vvv.csv"   "A12_SITE_1678_H_vvv.csv" ".*[0-9]_[A-Z]_vvv.csv"

We can use a regex lookaround

sub("(?<=[0-9])(?=[A-Z])", "_", x, perl = TRUE)
#[1] "A12_SITE_1234_J_vvv.csv" "A12_SITA_1234_J_vvv.csv" 
#[3] "A12_SITE_1678_H_vvv.csv" "A12_SITE_145_C_vvv.csv" 

Or with capture groups ( (..) ) to capture the pattern as a group and then in the replacement use the backreference ( \\1, \\2 ) of the captured group

sub("([0-9])([A-Z])", "\\1_\\2", x, perl = TRUE)

In the OP's code, the pattern .* (any characters) followed by a number ( [0-9] ) and a alphabet ( [AZ] ) is not captured, so it gets lost in the replacement. Also, in the replacement, if we use [0-9] , it will taken as literal strings

Use a capturing group with backrefences in the replacement pattern (note that replacement patterns cannot be regex patterns, you only use regex to search for some text):

> sub("(.*[0-9])([A-Z])", "\\1_\\2", x)
[1] "A12_SITE_1234_J_vvv.csv" "A12_SITA_1234_J_vvv.csv" "A12_SITE_1678_H_vvv.csv" "A12_SITE_145_C_vvv.csv" 

See the R online demo and the regex demo .

Pattern details

  • (.*[0-9]) - Group 1 ( \1 ): any 0+ chars as many as possible up to and inclusing a digit
  • ([AZ]) - Group 2 ( \2 ): an uppercase ASCII letter.

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM