R：如何提取向量的两个字符之间的确切字符

Question

dat1 <- c('human(display_long)|uniprotkb:ESR1(gene name)')
dat2 <- c('human(display_long)|uniprotkb:TP53(gene name)')
dat3 <- c('human(display_long)|uniprotkb:GPX4(gene name)')
dat4 <- c('human(display_long)|uniprotkb:ALOX15(gene name)')
dat5 <- c('human(display_long)|uniprotkb:PGR(gene name)')
dat <- c(dat1,dat2,dat3,dat4,dat5)

how to extract the gene name between 'human(display_long)|uniprotkb:' and '(gene name)' for vector dat.Thanks!如何提取载体 dat 的 'human(display_long)|uniprotkb:' 和 '(gene name)' 之间的基因名称。谢谢！

Answer 1

You can use regexpr and regmatches to extract the text between human(display_long)|uniprotkb: and (gene name) .您可以使用regexpr和regmatches提取human(display_long)|uniprotkb:和(gene name)之间的文本。

regmatches(dat
 , regexpr("(?<=human\\(display_long\\)\\|uniprotkb:).*(?=\\(gene name\\))"
 , dat, perl=TRUE))
#[1] "ESR1"   "TP53"   "GPX4"   "ALOX15" "PGR"

Where (?<=human\\\\(display_long\\\\)\\\\|uniprotkb:) is a positive look behind for human(display_long)|uniprotkb: and (?=\\\\(gene name\\\\) is a positive look ahead for (gene name) and .* is the text in between.其中(?<=human\\\\(display_long\\\\)\\\\|uniprotkb:)是对human(display_long)|uniprotkb:的正面展望，而(?=\\\\(gene name\\\\)是对human(display_long)|uniprotkb:的正面展望(gene name)和.*是中间的文本。

Another way is to use sub but this might fail in case there is no match.另一种方法是使用sub但如果没有匹配，这可能会失败。

sub(".*human\\(display_long\\)\\|uniprotkb:(.*)\\(gene name\\).*", "\\1", dat)
#[1] "ESR1"   "TP53"   "GPX4"   "ALOX15" "PGR"

Other ways not searching for the full pattern might be:其他不搜索完整模式的方法可能是：

regmatches(dat, regexpr("(?<=:)[^(]*", dat, perl=TRUE))
sub(".*:([^(]*).*", "\\1", dat)
sub(".*:(.*)\\(.*", "\\1", dat)

Answer 2

You can try this regex which will extract the text between 'uniprotkb' and opening round brackets ( ( ).您可以尝试使用此正则表达式，它将提取'uniprotkb'和'uniprotkb'括号 ( ( ) 之间的文本。

sub('.*uniprotkb:(\\w+)\\(.*', '\\1', dat)
#[1] "ESR1"   "TP53"   "GPX4"   "ALOX15" "PGR"

Answer 3

Using stringr and look behind you could try this:使用stringr并查看后面你可以试试这个：

library(stringr)
str_extract(dat, "(?<=:)[A-z0-9]+")
#[1] "ESR1"   "TP53"   "GPX4"   "ALOX15" "PGR"

Assuming that there is only one colon which precedes the gene name.假设在基因名称之前只有一个冒号。

Answer 4

We can use str_remove_all我们可以使用str_remove_all

library(stringr)
str_remove_all(dat, ".*uniprotkb:|\\(.*")
[1] "ESR1"   "TP53"   "GPX4"   "ALOX15" "PGR"

Or use trimws from base R或者使用来自base R trimws

trimws(dat, whitespace = ".*uniprotkb:|\\(.*")
[1] "ESR1"   "TP53"   "GPX4"   "ALOX15" "PGR"

R：如何提取向量的两个字符之间的确切字符

问题描述

4 个解决方案

解决方案1
1 2021-07-12 13:29:46

解决方案2
0 2021-07-12 13:24:15

解决方案3
0 2021-07-12 13:38:56

解决方案4
0 2021-07-12 16:24:16

R：如何提取向量的两个字符之间的确切字符

问题描述

4 个解决方案

解决方案1 1 2021-07-12 13:29:46

解决方案2 0 2021-07-12 13:24:15

解决方案3 0 2021-07-12 13:38:56

解决方案4 0 2021-07-12 16:24:16

解决方案1
1 2021-07-12 13:29:46

解决方案2
0 2021-07-12 13:24:15

解决方案3
0 2021-07-12 13:38:56

解决方案4
0 2021-07-12 16:24:16