给定字符向量中具有多个模式的正则表达式

Question

I have a string (x, see below) that has many different formats. 我有一个具有许多不同格式的字符串（x，请参见下文）。 They are all positions on a genome but have different names. 它们都是基因组上的所有位置，但名称不同。 These names were given to me and belong to a list of about 6 million so it's not easy for me to change manually. 这些名称是给我的，属于大约600万个列表，因此手动更改并不容易。 This is a subset, however there are others like X1 or chr 13 that are part of this list too.: 这是一个子集，但是也有其他类似X1或chr 13的子集。

 x <- c("rs62224609.16051249.T.C", "rs376238049.16052962.C.T","rs62224614.16053862.C.T","X22.17028719.G.A", "rs4535153", "X22.17028719.G.A", "kgp3171179", "rs375850426.17029070.GCAGTGGC.G"  , "chr22.17030620.G.A")

I'd like all the string to look like this: 我希望所有字符串看起来像这样：

 y <- c("rs62224609", "rs376238049", "rs62224614", "chr22:17028719", "rs4535153", "chr22:17028719", "kgp3171179", "rs375850426", "chr22:17030620")

I've tried the following, but everything after the first "." 我已经尝试了以下方法，但是所有在第一个“。”之后的内容 is removed... which isn't exactly what I want. 被删除了...这不是我想要的。

x.test = gsub(pattern = "\\.\\S+$", replacement = "", x = x)

Any help would be greatly appreciated! 任何帮助将不胜感激！

Answer 1

If all your data corresponds to the examples you've given: 如果所有数据都与您给出的示例相对应：

x = c("rs62224609.16051249.T.C", "rs376238049.16052962.C.T","rs62224614.16053862.C.T","X22.17028719.G.A", "rs4535153", "X22.17028719.G.A", "kgp3171179", "rs375850426.17029070.GCAGTGGC.G"  , "chr22.17030620.G.A")

There are two types of ids, the ones with SNP ids (starting with rs or kgp), and the ones giving a chromosomal position (starting with the chromosome name). 有两种类型的ID，一种具有SNP ID（以rs或kgp开头），另一种具有染色体位置（以染色体名称开头）。 You could start off by identifying your SNP ids, with something like: 您可以先确定您的SNP ID，例如：

x1 = gsub("((rs|kgp)\\d+).*","\\1",x)

This returns: 返回：

[1] "rs62224609"     "rs376238049"    "rs62224614"     "X22.17028719.G.A"   "rs4535153"     "X22.17028719.G.A"   "kgp3171179"     "rs375850426"     "chr22.17030620.G.A"

Then format the chromosome positions with (I've assumed that you had chromosomes from 1 to 22, X,Y and M, but this depends on your data): 然后使用以下格式格式化染色体位置（我假设您的染色体从1到22，X，Y和M，但这取决于您的数据）：

## We look for [(chr OR X) (1 or 2 digits or X or Y or M) 1 or more punctuation marks (1 or more digits) anything] and 
## we transform it into: [chr (the second captured element) : (the third captured element)]
x2 = gsub("(chr|X)(\\d{1,2}|X|Y|M)[[:punct:]]+(\\d+).*","chr\\2:\\3",x1)

This returns: 返回：

[1] "rs62224609"     "rs376238049"    "rs62224614"     "chr22:17028719" "rs4535153"      "chr22:17028719"   "kgp3171179"     "rs375850426"    "chr22:17030620"

给定字符向量中具有多个模式的正则表达式

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-06-13 00:51:20

给定字符向量中具有多个模式的正则表达式

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-06-13 00:51:20

解决方案1
2 已采纳 2017-06-13 00:51:20