提取与模式匹配的特定单词

Question

I have data frame with a column:我有一个列的数据框：

nf1$Info = AC=1;AF=0.500;AN=2;BaseQRankSum=-1.026e+00;ClippingRankSum=-1.026e+00;DP=4;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=28.25;MQRankSum=-1.026e+00;QD=10.18;ReadPosRankSum=1.03;SOR=0.693

I'm trying to extract a specific value from this column.我正在尝试从此列中提取特定值。

For eg I'm interested in "MQRankSum" and I used:例如，我对“MQRankSum”感兴趣，我使用了：

str_extract(nf1$Info,"[MQRankSum]+=[:punct:]+[0-9]+[.]+[0-9]+")

It returns value for BaseQRankSum instead of MQRankSum .它返回BaseQRankSum而不是MQRankSum 的值。

Answer 1

Including characters into square brackets creates a character class matching any of the defined characters, so [yes]+ matches yyyyyyyyy , eyyyyss , etc.将字符包含在方括号中会创建一个匹配任何已定义字符的字符类，因此[yes]+匹配yyyyyyyyy 、 eyyyyss等。

What you want to do is to match a word MQRankSum , = , and then any chars other than ;您想要做的是匹配一个单词MQRankSum 、 = ，然后匹配除;以外的任何字符; : ：

str_extract(nf1$Info,"MQRankSum=[^;]+")

If you want to exlcude MQRankSum= from the match, use a lookbehind:如果您想从匹配中排除MQRankSum= ，请使用后视：

str_extract(nf1$Info,"(?<=MQRankSum=)[^;]+")
                      ^^^^^^^^^^^^^^^

The (?<=MQRankSum=) positive lookbehind will make sure there is MQRankSum= text immediately to the left of the current location, and only after that will match 1 or more chars other than ; (?<=MQRankSum=)正向后视将确保在当前位置的左侧有MQRankSum=文本，并且只有在此之后才会匹配 1 个或多个字符，而不是; . .

Answer 2

We could split INFO column into multiple columns then extract desired column:我们可以将INFO列拆分为多列，然后提取所需的列：

# dummy data
df1 <- data.frame(x = 1:3,
                  info = c("AC=1;AF=0.500;MQRankSum=2;BaseQRankSum=-1.026e+00;ClippingRankSum=-1.026e+00;",
                           "AC=1;AF=0.500;MQRankSum=2;ClippingRankSum=-1.026e+00;DP=4;",
                           "AN=2;BaseQRankSum=-1.026e+00;"),
                  stringsAsFactors = FALSE)

# split INFO into seperate columns
df1_info <- data.table::rbindlist(
  lapply(strsplit(df1$info, ";|="), function(i)
    setNames(data.frame(t(as.numeric(i[ c(FALSE, TRUE) ]))), i[ c(TRUE, FALSE) ])
    ),
  fill = TRUE)

df1_info
#    AC  AF MQRankSum BaseQRankSum ClippingRankSum DP AN
# 1:  1 0.5         2       -1.026          -1.026 NA NA
# 2:  1 0.5         2           NA          -1.026  4 NA
# 3: NA  NA        NA       -1.026              NA NA  2

# extract required column 
df1_info$BaseQRankSum
# [1] -1.026     NA -1.026

VCF INFO standard : VCF信息标准：

Various site-level annotations.各种站点级注释。 The annotations contained in the INFO field are represented as tag-value pairs, where the tag and value are separated by an equal sign, ie =, and pairs are separated by colons, ie ; INFO 字段中包含的注释表示为标记值对，其中标记和值用等号分隔，即 =，对用冒号分隔，即； as in this example:如本例所示：
MQ=99.00;MQ0=0;QD=17.94 . MQ=99.00;MQ0=0;QD=17.94 。

提取与模式匹配的特定单词

问题描述

2 个解决方案

解决方案1
4 已采纳 2018-07-02 09:19:40

解决方案2
1 2018-07-02 10:03:04

提取与模式匹配的特定单词

问题描述

2 个解决方案

解决方案1 4 已采纳 2018-07-02 09:19:40

解决方案2 1 2018-07-02 10:03:04

解决方案1
4 已采纳 2018-07-02 09:19:40

解决方案2
1 2018-07-02 10:03:04