I have a table with information of mutations in a column is the amino acid change in three letters code as follow:
Amino acid change ------------------------ NP_006209.2:p.1025 NP_203524.1:p.12 NP_000537.3:p.273 NP_004324.2:p.600 NP_000537.3:p.215
In another table I have the three letters code and the one letter code of the amino acids as follow:
three_letters|one_letters
-------------|-----------
Ala |A
Arg |R
Asn |N
Asp |D
...
Val |V
Asx |B
Glx |Z
Ter |*
I need a new column in my table of mutations with the amino acids in one letter code as follow:
new column ----------- p.1025 p.12 p.273 p.600 p.215
You can solve this using a regular expression so long as the change code is always three letters followed by one or more digits followed by three letters.
regexp_match(change, 'p.(\D{3})(\d+)(\D{3})')
That returns an array that can be used to join to your lookup table and then reconstruct the shortened code.
with split as (
select *,
regexp_match(change, 'p.(\D{3})(\d+)(\D{3})') as parts
from changes
)
select s.*,
concat('p.',
coalesce(x1.one_letters, '?'),
parts[2],
coalesce(x2.one_letters, '?')
) as encoded_change
from split s
left join xlate x1 on x1.three_letters = s.parts[1]
left join xlate x2 on x2.three_letters = s.parts[3];
An alternative to the solution proposed by @Mike Organek is to create a short function to do this conversion for you.
Data Sample:
CREATE TEMPORARY TABLE map (three_letters text, one_letters text);
INSERT INTO map
VALUES ('Val','V'),('Glu','E'),('Thr','T'),('Ala','A');
Function:
CREATE OR REPLACE FUNCTION change_amino_acid(text)
RETURNS TEXT AS $BODY$
DECLARE i RECORD; acid TEXT;
BEGIN
acid := trim((string_to_array($1, ':p.'))[2]);
FOR i IN SELECT * FROM map
WHERE three_letters = ANY(regexp_split_to_array(acid, '\d+'))
LOOP
acid := replace(acid,i.three_letters,i.one_letters);
END LOOP;
RETURN 'p.'||acid;
END; $BODY$ LANGUAGE plpgsql;
How to call the function:
SELECT
change_amino_acid('NP_006209.2:p.Thr1025Ala'),
change_amino_acid('NP_004324.2:p.Val600Glu');
change_amino_acid | change_amino_acid
-------------------+-------------------
p.T1025A | p.V600E
After that all you need to do is to UPDATE
your table using the function
UPDATE my_table
SET newcolum = change_amino_acid(long_amino_acid);
Your string is in a very particular format. The prefix looks like a fixed length. Then it is followed by three characters, a number (presumably a position), and then three more characters.
If this is always the case, you don't need any real sophisticated machinery for the replacement. You can just use string operations:
with replacements as (
select 'Thr' as three_letters, 'T' as one_letter union all
select 'Ala' as three_letters, 'A' as one_letter
)
select v.*,
left(mutation, 14) || r1.one_letter || replace(substr(mutation, 18), r2.three_letters, r2.one_letter)
from (values ('NP_006209.2:p.Thr1025Ala')) v(mutation) left join
replacements r1
on r1.three_letters = substr(mutation, 15, 3) left join
replacements r2
on r2.three_letters = right(mutation, 3);
I would actually recommend that you change the data structure so the values are not all encoded in a single string. Put the results in multiple columns:
name
from_amino_acid
to_amino_acid
position
Actually, I don't know what is happening before the :
, nor whether the p.
is important. You might want to split that into more than one column as well. You can use logic like this to split the string:
select split_part(mutation, ':', 1) as name,
substring(split_part(mutation, ':', 2), 3, 3) as from_amino_acid,
(regexp_matches(split_part(mutation, ':', 2), '[0-9]+'))[1] as position,
right(mutation, 3) as to_amino_acid
from (values ('NP_006209.2:p.Thr1025Ala')) v(mutation);
This would simplify your SQL and probably your analyses as well.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.