[英]Regex to extract string between characters in R
我有一個非常混亂的文件,有多個分隔符,並且行之間的字段順序/數量不同。
V1 到 V5 列很好,但我想從 V9 中提取來自“Variant_seq”、“Reference_seq”的信息和來自“Dbxref”的 rsxxxx 編號。
另一個復雜之處是“Variant_seq”和“Reference_seq”字段可以是單個字符(“A”、“T”、“C”或“G”),也可以是多個逗號分隔的字符(例如“TTTT,TTC,GGGGGC”)。 這些字段可以位於 V9 的末尾或中間的任何位置。
V1 V2 V3 V4 V5 V9
9 dbSNP SNV 10007 10007 ID=1;Variant_seq=C;Dbxref=dbSNP_154:rs1449034754;evidence_values=Frequency,TOPMed;Reference_seq=T
9 dbSNP SNV 10009 10009 ID=2;Variant_seq=C,G;Dbxref=dbSNP_154:rs1587255763;evidence_values=Frequency;Reference_seq=A
9 dbSNP SNV 14824990 14824990 ID=30545117;Reference_seq=C;clinical_significance=benign;Variant_seq=GGGC,CCCCG;ancestral_allele=C;Dbxref=dbSNP_154:rs140144319;evidence_values=Frequency,1000Genomes,ESP,Phenotype_or_Disease,ExAC,TOPMed,gnomAD;global_minor_allele_frequency=1|0.004193|211
我最初想到了一個帶有多個分隔符的 awk -F '{print }' 但很快意識到這不是一個可行的解決方案,因為字段之間的行不一致。 dplyr::separate 在這里也沒有真正適應。
我試圖將每個單個字段分別提取到一個新列中,但該命令不處理字段位於行尾的情況:
gsub("Reference_seq[=]([^.]+)[;].*", "\\1", df$V9)
我找不到 grep 的解決方案,只有捕獲組 1 中的字段,如果沒有“;”則停止下列的。 謝謝你的幫助。
由於這有點復雜,我發現使用 stringr 包的str_extract()
function 在這里更容易使用。
在這種情況下,我使用 3 行單獨的行來提取感興趣的文本。 並使用 (?<=) 向后看運算符以避免前導文本。
df<- read.table(header=TRUE, text="V1 V2 V3 V4 V5 V9
9 dbSNP SNV 10007 10007 ID=1;Variant_seq=C;Dbxref=dbSNP_154:rs1449034754;evidence_values=Frequency,TOPMed;Reference_seq=T
9 dbSNP SNV 10009 10009 ID=2;Variant_seq=C,G;Dbxref=dbSNP_154:rs1587255763;evidence_values=Frequency;Reference_seq=A
9 dbSNP SNV 14824990 14824990 ID=30545117;Reference_seq=C;clinical_significance=benign;Variant_seq=GGGC,CCCCG;ancestral_allele=C;Dbxref=dbSNP_154:rs140144319;evidence_values=Frequency,1000Genomes,ESP,Phenotype_or_Disease,ExAC,TOPMed,gnomAD;global_minor_allele_frequency=1|0.004193|211"
)
library(stringr)
str_extract(df[,"V9"], "(?<=Variant_seq=).+?;")
str_extract(df[,"V9"], "(?<=Reference_seq=).+?;")
str_extract(df[,"V9"], "(?<=Dbxref=).+?;")
data.frame(Variant_seq, Reference_seq, Db_ref)
# Variant_seq Reference_seq Db_ref
# 1 C; <NA> dbSNP_154:rs1449034754;
# 2 C,G; <NA> dbSNP_154:rs1587255763;
# 3 GGGC,CCCCG; C; dbSNP_154:rs140144319;
這個最終的數據框現在可以cbind
回原來的
你可以做
stringr::str_match(df$V9, "Reference_seq=([^;]+);")[, 2L]
stringr::str_match(df$V9, "Variant_seq=([^;]+);")[, 2L]
stringr::str_match(df$V9, "Dbxref=([^;]+);")[, 2L]
Output
> stringr::str_match(df$V9, "Reference_seq=([^;]+);")[, 2L]
[1] NA NA "C"
> stringr::str_match(df$V9, "Variant_seq=([^;]+);")[, 2L]
[1] "C" "C,G" "GGGC,CCCCG"
> stringr::str_match(df$V9, "Dbxref=([^;]+);")[, 2L]
[1] "dbSNP_154:rs1449034754" "dbSNP_154:rs1587255763" "dbSNP_154:rs140144319"
一種可能的解決方案是在與sub
配對的字符串上使用strsplit
作為子字符串
cbind(df1, sapply(c("Variant","Reference_seq","Dbxref"), function(str)
sapply(strsplit(df1[,"V9"],";"), function(x) sub("dbSNP_.*:","",x[grep(str,x)]))))
V1 V2 V3 V4 V5
1 9 dbSNP SNV 10007 10007
2 9 dbSNP SNV 10009 10009
3 9 dbSNP SNV 14824990 14824990
V9
1 ID=1;Variant_seq=C;Dbxref=dbSNP_154:rs1449034754;evidence_values=Frequency,TOPMed;Reference_seq=T
2 ID=2;Variant_seq=C,G;Dbxref=dbSNP_154:rs1587255763;evidence_values=Frequency;Reference_seq=A
3 ID=30545117;Reference_seq=C;clinical_significance=benign;Variant_seq=GGGC,CCCCG;ancestral_allele=C;Dbxref=dbSNP_154:rs140144319;evidence_values=Frequency,1000Genomes,ESP,Phenotype_or_Disease,ExAC,TOPMed,gnomAD;global_minor_allele_frequency=1|0.004193|211
Variant Reference_seq Dbxref
1 Variant_seq=C Reference_seq=T Dbxref=rs1449034754
2 Variant_seq=C,G Reference_seq=A Dbxref=rs1587255763
3 Variant_seq=GGGC,CCCCG Reference_seq=C Dbxref=rs140144319
df1 <- structure(list(V1 = c(9L, 9L, 9L), V2 = c("dbSNP", "dbSNP", "dbSNP"
), V3 = c("SNV", "SNV", "SNV"), V4 = c(10007L, 10009L, 14824990L
), V5 = c(10007L, 10009L, 14824990L), V9 = c("ID=1;Variant_seq=C;Dbxref=dbSNP_154:rs1449034754;evidence_values=Frequency,TOPMed;Reference_seq=T",
"ID=2;Variant_seq=C,G;Dbxref=dbSNP_154:rs1587255763;evidence_values=Frequency;Reference_seq=A",
"ID=30545117;Reference_seq=C;clinical_significance=benign;Variant_seq=GGGC,CCCCG;ancestral_allele=C;Dbxref=dbSNP_154:rs140144319;evidence_values=Frequency,1000Genomes,ESP,Phenotype_or_Disease,ExAC,TOPMed,gnomAD;global_minor_allele_frequency=1|0.004193|211"
), new = list(c("Variant_seq=C", "Dbxref=rs1449034754", "Reference_seq=T"
), c("Variant_seq=C,G", "Dbxref=rs1587255763", "Reference_seq=A"
), c("Reference_seq=C", "Variant_seq=GGGC,CCCCG", "Dbxref=rs140144319"
))), row.names = c(NA, -3L), class = "data.frame")
這會將 V9 的所有子字段提取到單獨的列中,而無需使用正則表達式或包。 它使用 paste 和 chartr 將 V9 轉換為 dcf 格式,然后使用 read.dcf 將其讀入。 最后我們 append 將創建的列傳遞給 DF。
m <- DF$V9 |>
paste(collapse = "\n\n") |>
chartr(old = "=;", new = ":\n") |>
textConnection() |>
read.dcf()
DF2 <- cbind(DF, m)
> str(DF2)
'data.frame': 3 obs. of 14 variables:
$ V1 : int 9 9 9
$ V2 : chr "dbSNP" "dbSNP" "dbSNP"
$ V3 : chr "SNV" "SNV" "SNV"
$ V4 : int 10007 10009 14824990
$ V5 : int 10007 10009 14824990
$ V9 : chr "ID=1;Variant_seq=C;Dbxref=dbSNP_154:rs1449034754;evidence_values=Frequency,TOPMed;Reference_seq=T" "ID=2;Variant_seq=C,G;Dbxref=dbSNP_154:rs1587255763;evidence_values=Frequency;Reference_seq=A" "ID=30545117;Reference_seq=C;clinical_significance=benign;Variant_seq=GGGC,CCCCG;ancestral_allele=C;Dbxref=dbSNP"| __truncated__
$ ID : chr "1" "2" "30545117"
$ Variant_seq : chr "C" "C,G" "GGGC,CCCCG"
$ Dbxref : chr "dbSNP_154:rs1449034754" "dbSNP_154:rs1587255763" "dbSNP_154:rs140144319"
$ evidence_values : chr "Frequency,TOPMed" "Frequency" "Frequency,1000Genomes,ESP,Phenotype_or_Disease,ExAC,TOPMed,gnomAD"
$ Reference_seq : chr "T" "A" "C"
$ clinical_significance : chr NA NA "benign"
$ ancestral_allele : chr NA NA "C"
$ global_minor_allele_frequency: chr NA NA "1|0.004193|211"
或者這樣寫:
cbind(
DF,
read.dcf(textConnection(chartr("=;", ":\n", paste(DF$V9, collapse = "\n\n"))))
)
可重現形式的輸入 DF。
DF <- structure(list(V1 = c(9L, 9L, 9L), V2 = c("dbSNP", "dbSNP", "dbSNP"
), V3 = c("SNV", "SNV", "SNV"), V4 = c(10007L, 10009L, 14824990L
), V5 = c(10007L, 10009L, 14824990L), V9 = c("ID=1;Variant_seq=C;Dbxref=dbSNP_154:rs1449034754;evidence_values=Frequency,TOPMed;Reference_seq=T",
"ID=2;Variant_seq=C,G;Dbxref=dbSNP_154:rs1587255763;evidence_values=Frequency;Reference_seq=A",
"ID=30545117;Reference_seq=C;clinical_significance=benign;Variant_seq=GGGC,CCCCG;ancestral_allele=C;Dbxref=dbSNP_154:rs140144319;evidence_values=Frequency,1000Genomes,ESP,Phenotype_or_Disease,ExAC,TOPMed,gnomAD;global_minor_allele_frequency=1|0.004193|211"
)), class = "data.frame", row.names = c(NA, -3L))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.