如何使用R中的正則表達式從字符串中提取文本？

Question

我有一個向量字符串，例如：

x <- c("gene_biotype \"protein_coding\"; transcript_name \"IGHV3-66-201\"; 
transcript_source \"havana\"; transcript_biotype \"IG_V_gene\"; 
protein_id \"ENSP00000375041\"; protein_version \"2\"; tag 
\"cds_end_NF\"; tag \"mRNA_end_NF\"; tag \"basic\"; 
transcript_support_level \"NA\";",
"gene_id \"ENSG00000211973\"; gene_version \"2\"; transcript_id 
\"ENST00000390633\"; transcript_version \"2\"; exon_number \"1\"; 
gene_name \"IGHV1-69\"; gene_source \"ensembl_havana\"; gene_biotype 
\"IG_V_gene\"; transcript_name \"IGHV1-69-201\"; transcript_source 
\"ensembl_havana\"; transcript_biotype \"IG_V_gene\"; protein_id 
\"ENSP00000375042\"; protein_version \"2\"; tag \"cds_end_NF\"; tag 
\"mRNA_end_NF\"; tag \"basic\"; transcript_support_level \"NA\";",
"gene_id \"ENSG00000211973\"; gene_version \"2\"; transcript_id 
\"ENST00000390633\"; transcript_version \"2\"; exon_number \"2\"; 
gene_name \"IGHV1-69\"; gene_source \"ensembl_havana\"; gene_biotype 
\"protein_coding\";")

我需要提取遵循gene_biotype的引用文本（任何字符）。 例如：

[1] protein_coding\ 
[2] IG_V_gene\
[3] protein_coding\

我曾嘗試在stringr軟件包中使用str_extract，但無法使正則表達式正常工作。

任何幫助將不勝感激！

Answer 1

您可以在stringr包的幫助下使用正則表達式來獲取所需的數據。 例如

library(stringr)
str_match(x, "gene_biotype\\s+\"([^\"]+)\"")
#      [,1]                                [,2]            
# [1,] "gene_biotype \"protein_coding\""   "protein_coding"
# [2,] "gene_biotype \n\"IG_V_gene\""      "IG_V_gene"     
# [3,] "gene_biotype \n\"protein_coding\"" "protein_coding"

這將返回具有匹配項和類別的矩陣。 如果您只想要類別，可以做

str_match(x, "gene_biotype\\s+\"([^\"]+)\"")[,2]
# [1] "protein_coding" "IG_V_gene"      "protein_coding"

Answer 2

我在這里找到這個

stringi::stri_extract_all_regex(x, '(?<=").*?(?=")')[[1]][1]
#[1] "protein_coding"

如何使用R中的正則表達式從字符串中提取文本？

問題描述

2 個解決方案

解決方案1
5 已采納 2019-04-15 17:25:41

解決方案2
0 2019-04-15 17:31:32

如何使用R中的正則表達式從字符串中提取文本？

問題描述

2 個解決方案

解決方案1 5 已采納 2019-04-15 17:25:41

解決方案2 0 2019-04-15 17:31:32

解決方案1
5 已采納 2019-04-15 17:25:41

解決方案2
0 2019-04-15 17:31:32