简体   繁体   中英

regex expression with multiple patterns in a given character vector

I have a string (x, see below) that has many different formats. They are all positions on a genome but have different names. These names were given to me and belong to a list of about 6 million so it's not easy for me to change manually. This is a subset, however there are others like X1 or chr 13 that are part of this list too.:

 x <- c("rs62224609.16051249.T.C", "rs376238049.16052962.C.T","rs62224614.16053862.C.T","X22.17028719.G.A", "rs4535153", "X22.17028719.G.A", "kgp3171179", "rs375850426.17029070.GCAGTGGC.G"  , "chr22.17030620.G.A")

I'd like all the string to look like this:

 y <- c("rs62224609", "rs376238049", "rs62224614", "chr22:17028719", "rs4535153", "chr22:17028719", "kgp3171179", "rs375850426", "chr22:17030620")

I've tried the following, but everything after the first "." is removed... which isn't exactly what I want.

x.test = gsub(pattern = "\\.\\S+$", replacement = "", x = x)

Any help would be greatly appreciated!

If all your data corresponds to the examples you've given:

x = c("rs62224609.16051249.T.C", "rs376238049.16052962.C.T","rs62224614.16053862.C.T","X22.17028719.G.A", "rs4535153", "X22.17028719.G.A", "kgp3171179", "rs375850426.17029070.GCAGTGGC.G"  , "chr22.17030620.G.A")

There are two types of ids, the ones with SNP ids (starting with rs or kgp), and the ones giving a chromosomal position (starting with the chromosome name). You could start off by identifying your SNP ids, with something like:

x1 = gsub("((rs|kgp)\\d+).*","\\1",x)

This returns:

[1] "rs62224609"     "rs376238049"    "rs62224614"     "X22.17028719.G.A"   "rs4535153"     "X22.17028719.G.A"   "kgp3171179"     "rs375850426"     "chr22.17030620.G.A"

Then format the chromosome positions with (I've assumed that you had chromosomes from 1 to 22, X,Y and M, but this depends on your data):

## We look for [(chr OR X) (1 or 2 digits or X or Y or M) 1 or more punctuation marks (1 or more digits) anything] and 
## we transform it into: [chr (the second captured element) : (the third captured element)]
x2 = gsub("(chr|X)(\\d{1,2}|X|Y|M)[[:punct:]]+(\\d+).*","chr\\2:\\3",x1)

This returns:

[1] "rs62224609"     "rs376238049"    "rs62224614"     "chr22:17028719" "rs4535153"      "chr22:17028719"   "kgp3171179"     "rs375850426"    "chr22:17030620"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM