[英]regex expression with multiple patterns in a given character vector
I have a string (x, see below) that has many different formats. 我有一个具有许多不同格式的字符串(x,请参见下文)。 They are all positions on a genome but have different names.
它们都是基因组上的所有位置,但名称不同。 These names were given to me and belong to a list of about 6 million so it's not easy for me to change manually.
这些名称是给我的,属于大约600万个列表,因此手动更改并不容易。 This is a subset, however there are others like X1 or chr 13 that are part of this list too.:
这是一个子集,但是也有其他类似X1或chr 13的子集。
x <- c("rs62224609.16051249.T.C", "rs376238049.16052962.C.T","rs62224614.16053862.C.T","X22.17028719.G.A", "rs4535153", "X22.17028719.G.A", "kgp3171179", "rs375850426.17029070.GCAGTGGC.G" , "chr22.17030620.G.A")
I'd like all the string to look like this: 我希望所有字符串看起来像这样:
y <- c("rs62224609", "rs376238049", "rs62224614", "chr22:17028719", "rs4535153", "chr22:17028719", "kgp3171179", "rs375850426", "chr22:17030620")
I've tried the following, but everything after the first "." 我已经尝试了以下方法,但是所有在第一个“。”之后的内容 is removed... which isn't exactly what I want.
被删除了...这不是我想要的。
x.test = gsub(pattern = "\\.\\S+$", replacement = "", x = x)
Any help would be greatly appreciated! 任何帮助将不胜感激!
If all your data corresponds to the examples you've given: 如果所有数据都与您给出的示例相对应:
x = c("rs62224609.16051249.T.C", "rs376238049.16052962.C.T","rs62224614.16053862.C.T","X22.17028719.G.A", "rs4535153", "X22.17028719.G.A", "kgp3171179", "rs375850426.17029070.GCAGTGGC.G" , "chr22.17030620.G.A")
There are two types of ids, the ones with SNP ids (starting with rs or kgp), and the ones giving a chromosomal position (starting with the chromosome name). 有两种类型的ID,一种具有SNP ID(以rs或kgp开头),另一种具有染色体位置(以染色体名称开头)。 You could start off by identifying your SNP ids, with something like:
您可以先确定您的SNP ID,例如:
x1 = gsub("((rs|kgp)\\d+).*","\\1",x)
This returns: 返回:
[1] "rs62224609" "rs376238049" "rs62224614" "X22.17028719.G.A" "rs4535153" "X22.17028719.G.A" "kgp3171179" "rs375850426" "chr22.17030620.G.A"
Then format the chromosome positions with (I've assumed that you had chromosomes from 1 to 22, X,Y and M, but this depends on your data): 然后使用以下格式格式化染色体位置(我假设您的染色体从1到22,X,Y和M,但这取决于您的数据):
## We look for [(chr OR X) (1 or 2 digits or X or Y or M) 1 or more punctuation marks (1 or more digits) anything] and
## we transform it into: [chr (the second captured element) : (the third captured element)]
x2 = gsub("(chr|X)(\\d{1,2}|X|Y|M)[[:punct:]]+(\\d+).*","chr\\2:\\3",x1)
This returns: 返回:
[1] "rs62224609" "rs376238049" "rs62224614" "chr22:17028719" "rs4535153" "chr22:17028719" "kgp3171179" "rs375850426" "chr22:17030620"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.