Extracting tricky text in R using REBUS (or normal regular expression)

Question

I downloaded protein annotations about localisation from UNIPROT but unfortunately can't get REBUS and STRINGR to get me what I need. After too many fails I would like to ask for some help, thanks a lot!

I am using stringR and REBUS but normal regular expression could probably also do the trick (I prefer REBUS though as its easier to read)

#df
startDF <- data.frame(UNIPROT = c("U123", "U223", "U334"),
                   localisation = c("SUBCELLULAR LOCATION: Cell membrane {ECO:0000250}. Membrane {ECO:0000305}; Single-pass membrane protein {ECO:0000305}. Note=Colocalizes with EHD1 and EHD2 at plasma membrane in myoblasts and myotubes. Localizes into foci at the plasma membrane (By similarity). {ECO:0000250}.", "SUBCELLULAR LOCATION: Cytoplasm, cytosol {ECO:0000269|PubMed:11554768}. Endoplasmic reticulum {ECO:0000269|PubMed:11554768}. Note=May transiently interact with the endoplasmic reticulum.", "SUBCELLULAR LOCATION: Lysosome membrane {ECO:0000305|PubMed:14592447}; Multi-pass membrane protein {ECO:0000255}."))

#packages
library(stringr)
library(rebus)

#tried to extract the first entry like this, but no success:
str_extract(startDF$localisation, pattern = "SUBCELLULAR LOCATION:" %R% WRD %R% OPEN_BRACKET %R% END)


#hoped for result
resultDF <- data.frame(UNIPROT = c("U123", "U223", "U334"),
                       primary_loc = c("Cell membrane", "Cytoplasm", "Lysosome membrane"),
                       other_loc = c("Membrane;Single-pass membrane protein" , "Endoplasmic reticulum",  "Multi-pass membrane protein"),
                       note = c(NA, "May transiently interact with the endoplasmic reticulum", NA))

At the end I would like to have the info separated in cols, amazing would be to get the primary location first, then secondary locs and then the note if there is any). Bonus: if you could differentiate between actual secondary localisations and the description of the transmembrane domain type you deserve a medal!

Thanks a bunch for your help!

Answer 1

There are probably waaaay easier ways to achieve the same result, but here is my first go at this problem... Hope this will get you started...

library( data.table )

#1 split the location-strings, using "Note=" as split character
l <- data.table::tstrsplit( startDF$localisation, "Note=", fixed = FALSE )

#2 now, get the locations by splitting the location-strings
#first, strip the `SUBCELLULAR LOCATION:`
l <- lapply( l, function(x) gsub( "^SUBCELLULAR LOCATION: ", "", x ) )
#and get ritd of all the stuff within { ... }
l <- lapply( l, function(x) gsub( "\\{.*?\\}", "", x ) )
#not split the locations on . and ;, and trim whitespace
locations <- lapply( strsplit( l[[1]], "[.;]", fixed = FALSE ), trimws )
#remove eventual empty locations
locations <- lapply( locations, function(x) subset(x, nchar(x) > 0) )
#paste locations together
locations <- lapply( locations, paste0, collapse = ";")

#3 and the note?
notes <- l[[2]]

#4 now we build the final data.table
#first step is easy ;-)
dt <- data.table( UNIPROT  = startDF$UNIPROT )
#get the maximum number of locations
max_loc <- length( tstrsplit( locations,";" ) )
#input the locations
dt[, paste0("location_", 1:max_loc) := tstrsplit( locations, ";" ) ]
#add the note
dt[, note := notes ]

This results into (sorry for the screenshot, since the notes are waaay long, I could not produce a decent print)

Extracting tricky text in R using REBUS (or normal regular expression)

Question

1 answers

solution1
1 ACCPTED 2019-09-17 19:39:54

Extracting tricky text in R using REBUS (or normal regular expression)

Question

1 answers

solution1 1 ACCPTED 2019-09-17 19:39:54

solution1
1 ACCPTED 2019-09-17 19:39:54