简体   繁体   English

从 R 中的长字符串中提取 substring

[英]extract substring from a long string in R

I have a string that looks like this:我有一个看起来像这样的字符串:

C|3_prime_UTR_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000383070|protein_coding|1/1||ENST00000383070.1:c.*51C>G||762|||||rs781744002|1||-1||SNV|1|HGNC|11311|YES|||CCDS14772.1|ENSP00000372547|Q05066|Q6J4J1&A7WPU8|UPI0000135F78|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|upstream_gene_variant|MODIFIER|RNASEH2CP1|ENSG00000237659|Transcript|ENST00000454281|processed_pseudogene||||||||||rs781744002|1|2889|1||SNV|1|HGNC|24117|YES||||||||||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|RNU6-1334P|ENSG00000251841|Transcript|ENST00000516032|snRNA||||||||||rs781744002|1|2085|1||SNV|1|HGNC|48297|YES||||||||||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000525526|protein_coding||||||||||rs781744002|1|70|-1||SNV|1|HGNC|11311|||||ENSP00000437575||F5H6J8|UPI0002064E1A|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000534739|protein_coding||||||||||rs781744002|1|166|-1||SNV|1|HGNC|11311|||||ENSP00000438917||F5H3H1|UPI0002064E1B|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||"

what I wish to do is to extract from the string the data that is bold我想做的是从字符串中提取粗体数据

C| C| 3_prime_UTR_variant |MODIFIER| 3_prime_UTR_variant |修改器| SRY |ENSG00000184895|Transcript|ENST00000383070|protein_coding|1/1||ENST00000383070.1:c.*51C>G||762|||||rs781744002|1||-1||SNV|1|HGNC|11311|YES|||CCDS14772.1|ENSP00000372547|Q05066|Q6J4J1&A7WPU8|UPI0000135F78|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|upstream_gene_variant|MODIFIER|RNASEH2CP1|ENSG00000237659|Transcript|ENST00000454281|processed_pseudogene||||||||||rs781744002|1|2889|1||SNV|1|HGNC| SRY |ENSG00000184895|成绩单|ENST00000383070|protein_coding|1/1||ENST00000383070.1:c.*51C>G||762|||||rs781744002|1||-1||SNV|1|HGNC|11311|是|||CCDS14772.1|ENSP00000372547|Q05066|Q6J4J1&A7WPU8|UPI0000135F78|1||||||C:0|||||||C:0|C:2.855e-05|C:0|C :3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|upstream_gene_variant|MODIFIER|RNASEH2CP1|ENSG00000237659|Transcript|ENST00000454281 |processed_pseudogene||||||||||rs781744002|1|2889|1||SNV|1|HGNC| 24117 |YES||||||||||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|RNU6-1334P|ENSG00000251841|Transcript|ENST00000516032|snRNA||||||||||rs781744002|1|2085|1||SNV|1|HGNC|48297|YES||||||||||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000525526|protein_coding||||||||||rs781744002|1|70|-1||SNV|1|HGNC|11311|||||ENSP00000437575||F5H6J8|UPI0002064E1A|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000534739|protein_coding||||||||||rs781744002|1|166|-1||SNV|1|HGNC|11311|||||ENSP00000438917||F5H3H1|UPI0002064E1B|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||" 24117 |是|||||||||||||C:0|||||||C:0|C:2.855e-05|C:0|C:3.067e-05| C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|修饰符|RNU6-1334P|ENSG00000251841|转录本|ENST00000516032|snRNA|| ||||||||rs781744002|1|2085|1||SNV|1|HGNC|48297|是|||||||||||||C:0||||||| |C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||| |||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|转录本|ENST00000525526|protein_coding|||||||||rs781744002|1|70|-1||SNV|1|HGNC|11311|||| |ENSP00000437575||F5H6J8|UPI0002064E1A|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C: 0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000534739|protein_coding|||||| ||||rs781744002|1|166|-1||SNV|1|HGNC|11311|||||ENSP00000438917||F5H3H1|UPI0002064E1B|1|||||C:0||||||| |C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||| |||”

usually, i use this type of code:通常,我使用这种类型的代码:

str_extract(data_snp$vep, "(?<=xxx=)[^|]+")

but this time it didn't work.但是这次没有用。 Is there any way that R can do this? R 有什么办法可以做到这一点吗? thank you:)谢谢你:)

We can use read.delim for this:我们可以为此使用read.delim

txt <- "C|3_prime_UTR_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000383070|protein_coding|1/1||ENST00000383070.1:c.*51C>G||762|||||rs781744002|1||-1||SNV|1|HGNC|11311|YES|||CCDS14772.1|ENSP00000372547|Q05066|Q6J4J1&A7WPU8|UPI0000135F78|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|upstream_gene_variant|MODIFIER|RNASEH2CP1|ENSG00000237659|Transcript|ENST00000454281|processed_pseudogene||||||||||rs781744002|1|2889|1||SNV|1|HGNC|24117|YES||||||||||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|RNU6-1334P|ENSG00000251841|Transcript|ENST00000516032|snRNA||||||||||rs781744002|1|2085|1||SNV|1|HGNC|48297|YES||||||||||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000525526|protein_coding||||||||||rs781744002|1|70|-1||SNV|1|HGNC|11311|||||ENSP00000437575||F5H6J8|UPI0002064E1A|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000534739|protein_coding||||||||||rs781744002|1|166|-1||SNV|1|HGNC|11311|||||ENSP00000438917||F5H3H1|UPI0002064E1B|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||"
unlist(read.delim(text = txt, sep = "|", header = FALSE)[,c(2,4,93)], use.names = FALSE)
# [1] "3_prime_UTR_variant" "SRY"                 "24117"              

If you use unlist(.) without use.names=FALSE , you get V1 etc for names, but they are harmless.如果您在不使用use.names=FALSE的情况下使用unlist(.) ,您将获得V1等名称,但它们是无害的。

One possible way to solve your problem:解决问题的一种可能方法:

lapply(strsplit(data_snp$vep, "\\|+"), \(x) intersect(x, c("3_prime_UTR_variant", "SRY", "24117")))

[[1]]
[1] "3_prime_UTR_variant" "SRY"                 "24117" 

You can use strsplit() :您可以使用strsplit()

strsplit(txt, '\\|')[[1]][c(2, 4, 93)]

# [1] "3_prime_UTR_variant" "SRY"                 "24117"

or stringr::word() :stringr::word()

stringr::word(txt, c(2, 4, 93), sep = '\\|')

# [1] "3_prime_UTR_variant" "SRY"                 "24117"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM