简体   繁体   中英

Match string(1) within another string(2) and extract position information based on string(2)

I would like to match a string(1) in another string(2) and based on the sequence information contained in string(1), extract the position information based on string(2). I have a dataframe containing peptide (amino acid) sequences with information of additional chemical modification. These occur at M or C positions. I would like to be able to match these strings to the file of origin that has all of the sequences of proteins that were matched against using spectral match algorithms and output the amino acid and the position from that protein.

I've used the seqinr package to read in a .fasta file which contains 20320 entries and the entries look like this:

$`sp|Q9Y478|AAKB1_HUMAN` [1]"MGNTSSERAALERHGGHKTPRRDSSGGTKDGDRPKILMDSPEDADLFHSEEIKAPEKEEFLAWQHDLEVNDKAPAQARPTVFRWTGGGKEVYLSGSFNNWSKLPLTRSHNNFVAILDLPEGEHQYKFFVDGQWTHDPSEPIVTSQLGTVNNIIQVKKTDFEVFDALMVDSQKCSDVSELSSSPPGPYHQEPYVCKPEERFRAPPILPPHLLQVILNKDTGISCDPALLPEPNHVMLNHLYALSIKDGVMVLSATHRYKKKYVTTLLYKPI"

I have a separate dataframe containing a list of peptides, example:

           ptm_probability                    ptm_peptide            protein_ID protein_description
1 C(1.000)SDFTEEIC(1.000)R K.C[478.99]SDFTEEIC[478.99]R.R sp|P50213|IDH3A_HUMAN Isocitrate dehydrogenase [NAD] subunit alpha, mitochondrial OS=Homo sapiens GN=IDH3A PE=1 SV=1

The amino acid sequence in ptm_probability shows the score and likelihood that the modification is there. The sequence in ptm_peptide has the amino acids before and after the sequence denoted by "." while the modification is contained within the brackets [478.99] The modification can contain different numbers.

Ideally I would like the output to contain a column for the list of peptides which shows the amino acid one letter code followed by the numerical position within the protein:

position
C32
C16, C20

Which packages/functions would enable me to do this? Can I try to match the sequence as is and give a command to ignore the modification [478.99] to fit the format in which the fasta file currently is? Or should stripping the mods and then coming up a way to calculate the relative position based on the start/end positions of the peptide? What is a fast way to do this If I have to match several hundreds/thousands of peptide sequences against a 20k list? Any suggestions would be greatly appreciated.

I am not sure as to the format of your data. For my solution I assume that you have a vector with the proteins in uppercase and I use the format of your ptm_probability column. The function checks one peptide against all proteins, it should be reasonably straight forward to use lapply or purrr:map to run it over all peptides.

My solution essentially converts modified amino acids to lowercase and then looks for the positions of lowercase letters in the protein sequences. It returns a list where for each protein there is a character vector with the modified amino acid and its position.

Data:

proteins <-c("PRQTEINCSDFTEEICRPRQTEIN",
             "SOMEPRQTEINCSDFTEEICRQTHER",
             "PRQTEINPRQTEIN")

peptide <- c("C(1.000)SDFTEEIC(1.000)R")

Function:

library(stringi)
library(purrr)

find_mods <- function(proteins, peptide){

  # first convert the amino acid with the modificiation
  # (prior to the opening parenthesis) to lowercase
  peptide <- gsub("(.)(?=\\()", "\\L\\1", peptide, perl = TRUE)

  # strip everything that is not a letter from the peptide string
  peptide <- gsub("[^[:alpha:]]", "", peptide)

  # do a case insensitive matching of the peptide sequence in the protein
  # and replace that occurrence with the peptide sequence. Now the modified
  # amino acids in the protein are in lowercase
  pattern <- paste0("(?i)", peptide)
  proteins <- gsub(pattern, peptide, proteins, perl = TRUE)

  # Find the lowercase letters in all proteins
  a <- gregexpr("[a-z]", proteins)
  matches_a <- regmatches(proteins, a)

  # Find the positions of all lowercase letters in all
  # proteins 
  l1 <- stringi::stri_locate_all(proteins, regex = "[a-z]")

  #combine letter and position of the modifications
  purrr::map2(matches_a,l1, ~ paste0(toupper(.x),.y[,1]) )
  }

Output:

find_mods(proteins, peptide)
[[1]]
[1] "C8"  "C16"

[[2]]
[1] "C12" "C20"

[[3]]
[1] "NA"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM