[英]extract substring from a long string in R
I have a string that looks like this:我有一个看起来像这样的字符串:
C|3_prime_UTR_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000383070|protein_coding|1/1||ENST00000383070.1:c.*51C>G||762|||||rs781744002|1||-1||SNV|1|HGNC|11311|YES|||CCDS14772.1|ENSP00000372547|Q05066|Q6J4J1&A7WPU8|UPI0000135F78|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|upstream_gene_variant|MODIFIER|RNASEH2CP1|ENSG00000237659|Transcript|ENST00000454281|processed_pseudogene||||||||||rs781744002|1|2889|1||SNV|1|HGNC|24117|YES||||||||||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|RNU6-1334P|ENSG00000251841|Transcript|ENST00000516032|snRNA||||||||||rs781744002|1|2085|1||SNV|1|HGNC|48297|YES||||||||||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000525526|protein_coding||||||||||rs781744002|1|70|-1||SNV|1|HGNC|11311|||||ENSP00000437575||F5H6J8|UPI0002064E1A|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000534739|protein_coding||||||||||rs781744002|1|166|-1||SNV|1|HGNC|11311|||||ENSP00000438917||F5H3H1|UPI0002064E1B|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||"
what I wish to do is to extract from the string the data that is bold我想做的是从字符串中提取粗体数据
C| C| 3_prime_UTR_variant |MODIFIER|
3_prime_UTR_variant |修改器| SRY |ENSG00000184895|Transcript|ENST00000383070|protein_coding|1/1||ENST00000383070.1:c.*51C>G||762|||||rs781744002|1||-1||SNV|1|HGNC|11311|YES|||CCDS14772.1|ENSP00000372547|Q05066|Q6J4J1&A7WPU8|UPI0000135F78|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|upstream_gene_variant|MODIFIER|RNASEH2CP1|ENSG00000237659|Transcript|ENST00000454281|processed_pseudogene||||||||||rs781744002|1|2889|1||SNV|1|HGNC|
SRY |ENSG00000184895|成绩单|ENST00000383070|protein_coding|1/1||ENST00000383070.1:c.*51C>G||762|||||rs781744002|1||-1||SNV|1|HGNC|11311|是|||CCDS14772.1|ENSP00000372547|Q05066|Q6J4J1&A7WPU8|UPI0000135F78|1||||||C:0|||||||C:0|C:2.855e-05|C:0|C :3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|upstream_gene_variant|MODIFIER|RNASEH2CP1|ENSG00000237659|Transcript|ENST00000454281 |processed_pseudogene||||||||||rs781744002|1|2889|1||SNV|1|HGNC| 24117 |YES||||||||||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|RNU6-1334P|ENSG00000251841|Transcript|ENST00000516032|snRNA||||||||||rs781744002|1|2085|1||SNV|1|HGNC|48297|YES||||||||||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000525526|protein_coding||||||||||rs781744002|1|70|-1||SNV|1|HGNC|11311|||||ENSP00000437575||F5H6J8|UPI0002064E1A|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000534739|protein_coding||||||||||rs781744002|1|166|-1||SNV|1|HGNC|11311|||||ENSP00000438917||F5H3H1|UPI0002064E1B|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||"
24117 |是|||||||||||||C:0|||||||C:0|C:2.855e-05|C:0|C:3.067e-05| C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|修饰符|RNU6-1334P|ENSG00000251841|转录本|ENST00000516032|snRNA|| ||||||||rs781744002|1|2085|1||SNV|1|HGNC|48297|是|||||||||||||C:0||||||| |C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||| |||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|转录本|ENST00000525526|protein_coding|||||||||rs781744002|1|70|-1||SNV|1|HGNC|11311|||| |ENSP00000437575||F5H6J8|UPI0002064E1A|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C: 0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000534739|protein_coding|||||| ||||rs781744002|1|166|-1||SNV|1|HGNC|11311|||||ENSP00000438917||F5H3H1|UPI0002064E1B|1|||||C:0||||||| |C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||| |||”
usually, i use this type of code:通常,我使用这种类型的代码:
str_extract(data_snp$vep, "(?<=xxx=)[^|]+")
but this time it didn't work.但是这次没有用。 Is there any way that R can do this?
R 有什么办法可以做到这一点吗? thank you:)
谢谢你:)
We can use read.delim
for this:我们可以为此使用
read.delim
:
txt <- "C|3_prime_UTR_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000383070|protein_coding|1/1||ENST00000383070.1:c.*51C>G||762|||||rs781744002|1||-1||SNV|1|HGNC|11311|YES|||CCDS14772.1|ENSP00000372547|Q05066|Q6J4J1&A7WPU8|UPI0000135F78|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|upstream_gene_variant|MODIFIER|RNASEH2CP1|ENSG00000237659|Transcript|ENST00000454281|processed_pseudogene||||||||||rs781744002|1|2889|1||SNV|1|HGNC|24117|YES||||||||||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|RNU6-1334P|ENSG00000251841|Transcript|ENST00000516032|snRNA||||||||||rs781744002|1|2085|1||SNV|1|HGNC|48297|YES||||||||||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000525526|protein_coding||||||||||rs781744002|1|70|-1||SNV|1|HGNC|11311|||||ENSP00000437575||F5H6J8|UPI0002064E1A|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||,C|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000534739|protein_coding||||||||||rs781744002|1|166|-1||SNV|1|HGNC|11311|||||ENSP00000438917||F5H3H1|UPI0002064E1B|1||||||C:0||||||||C:0|C:2.855e-05|C:0|C:3.067e-05|C:0|C:0|C:5.473e-05|C:0||||||||||||"
unlist(read.delim(text = txt, sep = "|", header = FALSE)[,c(2,4,93)], use.names = FALSE)
# [1] "3_prime_UTR_variant" "SRY" "24117"
If you use unlist(.)
without use.names=FALSE
, you get V1
etc for names, but they are harmless.如果您在不使用
use.names=FALSE
的情况下使用unlist(.)
,您将获得V1
等名称,但它们是无害的。
One possible way to solve your problem:解决问题的一种可能方法:
lapply(strsplit(data_snp$vep, "\\|+"), \(x) intersect(x, c("3_prime_UTR_variant", "SRY", "24117")))
[[1]]
[1] "3_prime_UTR_variant" "SRY" "24117"
You can use strsplit()
:您可以使用
strsplit()
:
strsplit(txt, '\\|')[[1]][c(2, 4, 93)]
# [1] "3_prime_UTR_variant" "SRY" "24117"
or stringr::word()
:或
stringr::word()
:
stringr::word(txt, c(2, 4, 93), sep = '\\|')
# [1] "3_prime_UTR_variant" "SRY" "24117"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.