简体   繁体   English

grep 匹配列内的模式和 r 中列的值

[英]grep to match pattern within a column and within the value of the column in r

I am trying to match various key-value patterns within an AdditionalInfo column, then output the key-value pairs as separate columns in R.我正在尝试匹配 AdditionalInfo 列中的各种键值模式,然后将键值对作为 R 中的单独列输出。

My single column has values like this with key-value pairs separated by semicolons (;):我的单列有这样的值,键值对用分号 (;) 分隔:

gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";

So I would want to use grep to match "gene_id "E[*]";"所以我想用 grep 来匹配 "gene_id "E[*]";" then output the found pattern to a new column;然后将找到的模式输出到新列; use grep to match "gene_type "[Aa-Zz]";"使用 grep 匹配 "gene_type "[Aa-Zz]";" then output the found pattern to a new column, etc.然后将找到的模式输出到新列等。

I can't just split the column on the semicolon because some rows have 6 key-value pairs, and some have 13 key-value pairs, and they are not in the same order and they are unique values.我不能只在分号上拆分列,因为有些行有 6 个键值对,有些行有 13 个键值对,而且它们的顺序不同并且它们是唯一值。

Can anyone help me with this?谁能帮我这个?

The code I am trying to use is the following:我尝试使用的代码如下:

geneID <- og[grep("gene_id "E[*]";", og$AdditionalInfo),]

Thanks for your time!谢谢你的时间!

Edit编辑

My data looks like this:我的数据如下所示:

> names(og)
[1] "Chromosome"     "AnnotSource"    "FeatureType"    "Start"          "Stop"          
[6] "Score"          "Strand"         "GenomicPhase"   "AdditionalInfo"

> head(og)
  Chromosome AnnotSource FeatureType Start  Stop Score Strand GenomicPhase
1       chr1      HAVANA        gene 11869 14409     .      +            .
2       chr1      HAVANA  transcript 11869 14409     .      +            .
3       chr1      HAVANA        exon 11869 12227     .      +            .
4       chr1      HAVANA        exon 12613 12721     .      +            .
5       chr1      HAVANA        exon 13221 14409     .      +            .
6       chr1      HAVANA  transcript 12010 13670     .      +            .

AdditionalInfo
1 gene_id ENSG00000223972.5; gene_type transcribed_unprocessed_pseudogene;     gene_status KNOWN; gene_name DDX11L1; level 2; havana_gene OTTHUMG00000000961.2;
2 gene_id ENSG00000223972.5; transcript_id ENST00000456328.2; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; transcript_type processed_transcript; transcript_status KNOWN; transcript_name DDX11L1-002; level 2; tag basic; transcript_support_level 1; havana_gene OTTHUMG00000000961.2; havana_transcript OTTHUMT00000362751.1;
3 gene_id ENSG00000223972.5; transcript_id ENST00000456328.2; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; transcript_type processed_transcript; transcript_status KNOWN; transcript_name DDX11L1-002; exon_number 1; exon_id ENSE00002234944.1; level 2; tag basic; transcript_support_level 1; havana_gene OTTHUMG00000000961.2; havana_transcript OTTHUMT00000362751.1;
4 gene_id ENSG00000223972.5; transcript_id ENST00000456328.2; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; transcript_type processed_transcript; transcript_status KNOWN; transcript_name DDX11L1-002; exon_number 2; exon_id ENSE00003582793.1; level 2; tag basic; transcript_support_level 1; havana_gene OTTHUMG00000000961.2; havana_transcript OTTHUMT00000362751.1;
5 gene_id ENSG00000223972.5; transcript_id ENST00000456328.2; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; transcript_type processed_transcript; transcript_status KNOWN; transcript_name DDX11L1-002; exon_number 3; exon_id ENSE00002312635.1; level 2; tag basic; transcript_support_level 1; havana_gene OTTHUMG00000000961.2; havana_transcript OTTHUMT00000362751.1;
6 gene_id ENSG00000223972.5; transcript_id ENST00000450305.2; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; transcript_type transcribed_unprocessed_pseudogene; transcript_status KNOWN; transcript_name DDX11L1-001; level 2; ont PGO:0000005; ont PGO:0000019; tag basic; transcript_support_level NA; havana_gene OTTHUMG00000000961.2; havana_transcript OTTHUMT00000002844.2;

> dput(head(og))
structure(list(Chromosome = c("chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1"), AnnotSource = c("HAVANA", "HAVANA", "HAVANA", 
"HAVANA", "HAVANA", "HAVANA"), FeatureType = c("gene", "transcript", 
"exon", "exon", "exon", "transcript"), Start = c(11869L, 11869L, 
11869L, 12613L, 13221L, 12010L), Stop = c(14409L, 14409L, 12227L, 
12721L, 14409L, 13670L), Score = c(".", ".", ".", ".", ".", "."
), Strand = c("+", "+", "+", "+", "+", "+"), GenomicPhase = c(".", 
".", ".", ".", ".", "."), AdditionalInfo = c("gene_id ENSG00000223972.5; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; level 2; havana_gene OTTHUMG00000000961.2;", 
"gene_id ENSG00000223972.5; transcript_id ENST00000456328.2; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; transcript_type processed_transcript; transcript_status KNOWN; transcript_name DDX11L1-002; level 2; tag basic; transcript_support_level 1; havana_gene OTTHUMG00000000961.2; havana_transcript OTTHUMT00000362751.1;", 
"gene_id ENSG00000223972.5; transcript_id ENST00000456328.2; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; transcript_type processed_transcript; transcript_status KNOWN; transcript_name DDX11L1-002; exon_number 1; exon_id ENSE00002234944.1; level 2; tag basic; transcript_support_level 1; havana_gene OTTHUMG00000000961.2; havana_transcript OTTHUMT00000362751.1;", 
"gene_id ENSG00000223972.5; transcript_id ENST00000456328.2; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; transcript_type processed_transcript; transcript_status KNOWN; transcript_name DDX11L1-002; exon_number 2; exon_id ENSE00003582793.1; level 2; tag basic; transcript_support_level 1; havana_gene OTTHUMG00000000961.2; havana_transcript OTTHUMT00000362751.1;", 
"gene_id ENSG00000223972.5; transcript_id ENST00000456328.2; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; transcript_type processed_transcript; transcript_status KNOWN; transcript_name DDX11L1-002; exon_number 3; exon_id ENSE00002312635.1; level 2; tag basic; transcript_support_level 1; havana_gene OTTHUMG00000000961.2; havana_transcript OTTHUMT00000362751.1;", 
"gene_id ENSG00000223972.5; transcript_id ENST00000450305.2; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; transcript_type transcribed_unprocessed_pseudogene; transcript_status KNOWN; transcript_name DDX11L1-001; level 2; ont PGO:0000005; ont PGO:0000019; tag basic; transcript_support_level NA; havana_gene OTTHUMG00000000961.2; havana_transcript OTTHUMT00000002844.2;"
)), .Names = c("Chromosome", "AnnotSource", "FeatureType", "Start", 
"Stop", "Score", "Strand", "GenomicPhase", "AdditionalInfo"), row.names = c(NA, 
6L), class = "data.frame")

You could use a regexp and a capturing group to select what is after gene_id between quotation marks.您可以使用正则表达式和捕获组来选择引号之间的gene_id之后的gene_id For example, using the data you posted:例如,使用您发布的数据:

sub('.*gene_id ([^;]*).*',"\\1",og$AdditionalInfo)
sub('.*gene_type ([^;]*).*',"\\1",og$AdditionalInfo)

Output:输出:

#[1] "ENSG00000223972.5"
#[1] "transcribed_unprocessed_pseudogene"

You can also use str_match from library(stringr) to get NA s if there are no matches:您还可以使用str_matchlibrary(stringr)获得NA ■如果没有匹配:

str_match(og$AdditionalInfo,".*transcript_id ([^;]*).*")[,2]

Output输出

#[1] NA                  "ENST00000456328.2" "ENST00000456328.2" "ENST00000456328.2"
#[5] "ENST00000456328.2" "ENST00000450305.2"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM