简体   繁体   中英

Extracting tricky text in R using REBUS (or normal regular expression)

I downloaded protein annotations about localisation from UNIPROT but unfortunately can't get REBUS and STRINGR to get me what I need. After too many fails I would like to ask for some help, thanks a lot!

I am using stringR and REBUS but normal regular expression could probably also do the trick (I prefer REBUS though as its easier to read)

#df
startDF <- data.frame(UNIPROT = c("U123", "U223", "U334"),
                   localisation = c("SUBCELLULAR LOCATION: Cell membrane {ECO:0000250}. Membrane {ECO:0000305}; Single-pass membrane protein {ECO:0000305}. Note=Colocalizes with EHD1 and EHD2 at plasma membrane in myoblasts and myotubes. Localizes into foci at the plasma membrane (By similarity). {ECO:0000250}.", "SUBCELLULAR LOCATION: Cytoplasm, cytosol {ECO:0000269|PubMed:11554768}. Endoplasmic reticulum {ECO:0000269|PubMed:11554768}. Note=May transiently interact with the endoplasmic reticulum.", "SUBCELLULAR LOCATION: Lysosome membrane {ECO:0000305|PubMed:14592447}; Multi-pass membrane protein {ECO:0000255}."))

#packages
library(stringr)
library(rebus)

#tried to extract the first entry like this, but no success:
str_extract(startDF$localisation, pattern = "SUBCELLULAR LOCATION:" %R% WRD %R% OPEN_BRACKET %R% END)


#hoped for result
resultDF <- data.frame(UNIPROT = c("U123", "U223", "U334"),
                       primary_loc = c("Cell membrane", "Cytoplasm", "Lysosome membrane"),
                       other_loc = c("Membrane;Single-pass membrane protein" , "Endoplasmic reticulum",  "Multi-pass membrane protein"),
                       note = c(NA, "May transiently interact with the endoplasmic reticulum", NA))


At the end I would like to have the info separated in cols, amazing would be to get the primary location first, then secondary locs and then the note if there is any). Bonus: if you could differentiate between actual secondary localisations and the description of the transmembrane domain type you deserve a medal!

Thanks a bunch for your help!

There are probably waaaay easier ways to achieve the same result, but here is my first go at this problem... Hope this will get you started...

library( data.table )

#1 split the location-strings, using "Note=" as split character
l <- data.table::tstrsplit( startDF$localisation, "Note=", fixed = FALSE )

#2 now, get the locations by splitting the location-strings
#first, strip the `SUBCELLULAR LOCATION:`
l <- lapply( l, function(x) gsub( "^SUBCELLULAR LOCATION: ", "", x ) )
#and get ritd of all the stuff within { ... }
l <- lapply( l, function(x) gsub( "\\{.*?\\}", "", x ) )
#not split the locations on . and ;, and trim whitespace
locations <- lapply( strsplit( l[[1]], "[.;]", fixed = FALSE ), trimws )
#remove eventual empty locations
locations <- lapply( locations, function(x) subset(x, nchar(x) > 0) )
#paste locations together
locations <- lapply( locations, paste0, collapse = ";")

#3 and the note?
notes <- l[[2]]

#4 now we build the final data.table
#first step is easy ;-)
dt <- data.table( UNIPROT  = startDF$UNIPROT )
#get the maximum number of locations
max_loc <- length( tstrsplit( locations,";" ) )
#input the locations
dt[, paste0("location_", 1:max_loc) := tstrsplit( locations, ";" ) ]
#add the note
dt[, note := notes ]

This results into (sorry for the screenshot, since the notes are waaay long, I could not produce a decent print) 在此处输入图片说明

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM