[英]grep to match pattern within a column and within the value of the column in r
I am trying to match various key-value patterns within an AdditionalInfo column, then output the key-value pairs as separate columns in R.我正在尝试匹配 AdditionalInfo 列中的各种键值模式,然后将键值对作为 R 中的单独列输出。
My single column has values like this with key-value pairs separated by semicolons (;):我的单列有这样的值,键值对用分号 (;) 分隔:
gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
So I would want to use grep to match "gene_id "E[*]";"所以我想用 grep 来匹配 "gene_id "E[*]";" then output the found pattern to a new column;然后将找到的模式输出到新列; use grep to match "gene_type "[Aa-Zz]";"使用 grep 匹配 "gene_type "[Aa-Zz]";" then output the found pattern to a new column, etc.然后将找到的模式输出到新列等。
I can't just split the column on the semicolon because some rows have 6 key-value pairs, and some have 13 key-value pairs, and they are not in the same order and they are unique values.我不能只在分号上拆分列,因为有些行有 6 个键值对,有些行有 13 个键值对,而且它们的顺序不同并且它们是唯一值。
Can anyone help me with this?谁能帮我这个?
The code I am trying to use is the following:我尝试使用的代码如下:
geneID <- og[grep("gene_id "E[*]";", og$AdditionalInfo),]
Thanks for your time!谢谢你的时间!
Edit编辑
My data looks like this:我的数据如下所示:
> names(og)
[1] "Chromosome" "AnnotSource" "FeatureType" "Start" "Stop"
[6] "Score" "Strand" "GenomicPhase" "AdditionalInfo"
> head(og)
Chromosome AnnotSource FeatureType Start Stop Score Strand GenomicPhase
1 chr1 HAVANA gene 11869 14409 . + .
2 chr1 HAVANA transcript 11869 14409 . + .
3 chr1 HAVANA exon 11869 12227 . + .
4 chr1 HAVANA exon 12613 12721 . + .
5 chr1 HAVANA exon 13221 14409 . + .
6 chr1 HAVANA transcript 12010 13670 . + .
AdditionalInfo
1 gene_id ENSG00000223972.5; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; level 2; havana_gene OTTHUMG00000000961.2;
2 gene_id ENSG00000223972.5; transcript_id ENST00000456328.2; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; transcript_type processed_transcript; transcript_status KNOWN; transcript_name DDX11L1-002; level 2; tag basic; transcript_support_level 1; havana_gene OTTHUMG00000000961.2; havana_transcript OTTHUMT00000362751.1;
3 gene_id ENSG00000223972.5; transcript_id ENST00000456328.2; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; transcript_type processed_transcript; transcript_status KNOWN; transcript_name DDX11L1-002; exon_number 1; exon_id ENSE00002234944.1; level 2; tag basic; transcript_support_level 1; havana_gene OTTHUMG00000000961.2; havana_transcript OTTHUMT00000362751.1;
4 gene_id ENSG00000223972.5; transcript_id ENST00000456328.2; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; transcript_type processed_transcript; transcript_status KNOWN; transcript_name DDX11L1-002; exon_number 2; exon_id ENSE00003582793.1; level 2; tag basic; transcript_support_level 1; havana_gene OTTHUMG00000000961.2; havana_transcript OTTHUMT00000362751.1;
5 gene_id ENSG00000223972.5; transcript_id ENST00000456328.2; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; transcript_type processed_transcript; transcript_status KNOWN; transcript_name DDX11L1-002; exon_number 3; exon_id ENSE00002312635.1; level 2; tag basic; transcript_support_level 1; havana_gene OTTHUMG00000000961.2; havana_transcript OTTHUMT00000362751.1;
6 gene_id ENSG00000223972.5; transcript_id ENST00000450305.2; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; transcript_type transcribed_unprocessed_pseudogene; transcript_status KNOWN; transcript_name DDX11L1-001; level 2; ont PGO:0000005; ont PGO:0000019; tag basic; transcript_support_level NA; havana_gene OTTHUMG00000000961.2; havana_transcript OTTHUMT00000002844.2;
> dput(head(og))
structure(list(Chromosome = c("chr1", "chr1", "chr1", "chr1",
"chr1", "chr1"), AnnotSource = c("HAVANA", "HAVANA", "HAVANA",
"HAVANA", "HAVANA", "HAVANA"), FeatureType = c("gene", "transcript",
"exon", "exon", "exon", "transcript"), Start = c(11869L, 11869L,
11869L, 12613L, 13221L, 12010L), Stop = c(14409L, 14409L, 12227L,
12721L, 14409L, 13670L), Score = c(".", ".", ".", ".", ".", "."
), Strand = c("+", "+", "+", "+", "+", "+"), GenomicPhase = c(".",
".", ".", ".", ".", "."), AdditionalInfo = c("gene_id ENSG00000223972.5; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; level 2; havana_gene OTTHUMG00000000961.2;",
"gene_id ENSG00000223972.5; transcript_id ENST00000456328.2; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; transcript_type processed_transcript; transcript_status KNOWN; transcript_name DDX11L1-002; level 2; tag basic; transcript_support_level 1; havana_gene OTTHUMG00000000961.2; havana_transcript OTTHUMT00000362751.1;",
"gene_id ENSG00000223972.5; transcript_id ENST00000456328.2; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; transcript_type processed_transcript; transcript_status KNOWN; transcript_name DDX11L1-002; exon_number 1; exon_id ENSE00002234944.1; level 2; tag basic; transcript_support_level 1; havana_gene OTTHUMG00000000961.2; havana_transcript OTTHUMT00000362751.1;",
"gene_id ENSG00000223972.5; transcript_id ENST00000456328.2; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; transcript_type processed_transcript; transcript_status KNOWN; transcript_name DDX11L1-002; exon_number 2; exon_id ENSE00003582793.1; level 2; tag basic; transcript_support_level 1; havana_gene OTTHUMG00000000961.2; havana_transcript OTTHUMT00000362751.1;",
"gene_id ENSG00000223972.5; transcript_id ENST00000456328.2; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; transcript_type processed_transcript; transcript_status KNOWN; transcript_name DDX11L1-002; exon_number 3; exon_id ENSE00002312635.1; level 2; tag basic; transcript_support_level 1; havana_gene OTTHUMG00000000961.2; havana_transcript OTTHUMT00000362751.1;",
"gene_id ENSG00000223972.5; transcript_id ENST00000450305.2; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; transcript_type transcribed_unprocessed_pseudogene; transcript_status KNOWN; transcript_name DDX11L1-001; level 2; ont PGO:0000005; ont PGO:0000019; tag basic; transcript_support_level NA; havana_gene OTTHUMG00000000961.2; havana_transcript OTTHUMT00000002844.2;"
)), .Names = c("Chromosome", "AnnotSource", "FeatureType", "Start",
"Stop", "Score", "Strand", "GenomicPhase", "AdditionalInfo"), row.names = c(NA,
6L), class = "data.frame")
You could use a regexp and a capturing group to select what is after gene_id
between quotation marks.您可以使用正则表达式和捕获组来选择引号之间的gene_id
之后的gene_id
。 For example, using the data you posted:例如,使用您发布的数据:
sub('.*gene_id ([^;]*).*',"\\1",og$AdditionalInfo)
sub('.*gene_type ([^;]*).*',"\\1",og$AdditionalInfo)
Output:输出:
#[1] "ENSG00000223972.5"
#[1] "transcribed_unprocessed_pseudogene"
You can also use str_match
from library(stringr)
to get NA
s if there are no matches:您还可以使用str_match
从library(stringr)
获得NA
■如果没有匹配:
str_match(og$AdditionalInfo,".*transcript_id ([^;]*).*")[,2]
Output输出
#[1] NA "ENST00000456328.2" "ENST00000456328.2" "ENST00000456328.2"
#[5] "ENST00000456328.2" "ENST00000450305.2"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.