简体   繁体   English

给定字符向量中具有多个模式的正则表达式

[英]regex expression with multiple patterns in a given character vector

I have a string (x, see below) that has many different formats. 我有一个具有许多不同格式的字符串(x,请参见下文)。 They are all positions on a genome but have different names. 它们都是基因组上的所有位置,但名称不同。 These names were given to me and belong to a list of about 6 million so it's not easy for me to change manually. 这些名称是给我的,属于大约600万个列表,因此手动更改并不容易。 This is a subset, however there are others like X1 or chr 13 that are part of this list too.: 这是一个子集,但是也有其他类似X1或chr 13的子集。

 x <- c("rs62224609.16051249.T.C", "rs376238049.16052962.C.T","rs62224614.16053862.C.T","X22.17028719.G.A", "rs4535153", "X22.17028719.G.A", "kgp3171179", "rs375850426.17029070.GCAGTGGC.G"  , "chr22.17030620.G.A")

I'd like all the string to look like this: 我希望所有字符串看起来像这样:

 y <- c("rs62224609", "rs376238049", "rs62224614", "chr22:17028719", "rs4535153", "chr22:17028719", "kgp3171179", "rs375850426", "chr22:17030620")

I've tried the following, but everything after the first "." 我已经尝试了以下方法,但是所有在第一个“。”之后的内容 is removed... which isn't exactly what I want. 被删除了...这不是我想要的。

x.test = gsub(pattern = "\\.\\S+$", replacement = "", x = x)

Any help would be greatly appreciated! 任何帮助将不胜感激!

If all your data corresponds to the examples you've given: 如果所有数据都与您给出的示例相对应:

x = c("rs62224609.16051249.T.C", "rs376238049.16052962.C.T","rs62224614.16053862.C.T","X22.17028719.G.A", "rs4535153", "X22.17028719.G.A", "kgp3171179", "rs375850426.17029070.GCAGTGGC.G"  , "chr22.17030620.G.A")

There are two types of ids, the ones with SNP ids (starting with rs or kgp), and the ones giving a chromosomal position (starting with the chromosome name). 有两种类型的ID,一种具有SNP ID(以rs或kgp开头),另一种具有染色体位置(以染色体名称开头)。 You could start off by identifying your SNP ids, with something like: 您可以先确定您的SNP ID,例如:

x1 = gsub("((rs|kgp)\\d+).*","\\1",x)

This returns: 返回:

[1] "rs62224609"     "rs376238049"    "rs62224614"     "X22.17028719.G.A"   "rs4535153"     "X22.17028719.G.A"   "kgp3171179"     "rs375850426"     "chr22.17030620.G.A"

Then format the chromosome positions with (I've assumed that you had chromosomes from 1 to 22, X,Y and M, but this depends on your data): 然后使用以下格式格式化染色体位置(我假设您的染色体从1到22,X,Y和M,但这取决于您的数据):

## We look for [(chr OR X) (1 or 2 digits or X or Y or M) 1 or more punctuation marks (1 or more digits) anything] and 
## we transform it into: [chr (the second captured element) : (the third captured element)]
x2 = gsub("(chr|X)(\\d{1,2}|X|Y|M)[[:punct:]]+(\\d+).*","chr\\2:\\3",x1)

This returns: 返回:

[1] "rs62224609"     "rs376238049"    "rs62224614"     "chr22:17028719" "rs4535153"      "chr22:17028719"   "kgp3171179"     "rs375850426"    "chr22:17030620"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM