Subset a character vector in R based on a specific pattern

Question

I have a vector of character ids, as rownames of a data frame in R. The rownames have the following pattern:

head(foo)
[1] "ENSG00000197372 (ZNF675)"   "ENSG00000112624 (GLTSCR1L)"
[3] "ENSG00000151320 (AKAP6)"    "ENSG00000139910 (NOVA1)"   
[5] "ENSG00000137449 (CPEB2)"    "ENSG00000004779 (NDUFAB1)"

I would like to somehow subset the above rownames (~700 entries) in order to keep only the gene symbols in the parenthesis part-ie ZNF675-and drop the rest part: is this possible through a function like gsub ?

Answer 1

We can use sub to match characters that are not ( , then capture the characters inside the ( which is not a ) and replace it with the backreference ( \\\\1 ) of the captured group

row.names(foo) <- sub("^[^(]+\\(([^)]+).*", "\\1", row.names(foo))
row.names(foo)
#[1] "ZNF675"   "GLTSCR1L" "AKAP6"    "NOVA1"    "CPEB2"    "NDUFAB1"

Or using str_extract from stringr

library(stringr)
str_extract(row.names(foo), "(?<=\\()[^)]+")

data

foo <- data.frame(col1 = rnorm(6))
row.names(foo) <- c("ENSG00000197372 (ZNF675)", 
  "ENSG00000112624 (GLTSCR1L)", "ENSG00000151320 (AKAP6)", 
     "ENSG00000139910 (NOVA1)",
   "ENSG00000137449 (CPEB2)", "ENSG00000004779 (NDUFAB1)")

Subset a character vector in R based on a specific pattern

Question

1 answers

solution1
3 2018-11-06 18:29:03

data

Subset a character vector in R based on a specific pattern

Question

1 answers

solution1 3 2018-11-06 18:29:03

data

solution1
3 2018-11-06 18:29:03