简体   繁体   中英

How to align and compare two elements (sequence) in a list using python

here is my question:

I've got a file which looks like this:

103L Sequence: MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL Disorder: ----------------------------------XXXXXX-----------------------------------------------------------------------------------------------------------------------------XX

It contains name, which in this case is 103L; protein sequence, which has "Sequence:" label; disorder region, which is after "Disorder:". the "-" represent that this position is ordered, and "X" represent that this particular position is disordered. For example, that last two "XX" under disorder represent that the last two position of the protein sequence is disordered, which is "NL". After I use split method, it looks like this:

['>103L', 'Sequence:', 'MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL', 'Disorder:', '----------------------------------XXXXXX-----------------------------------------------------------------------------------------------------------------------------XX']

I want to use python to find the disorder sequence and its position. So the final file should look somewhat like this: Name Sequence: 'real sequence' Disorder: position(Posi) residue-name(R) Take 103L as an example:

103L Sequence: MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL Disorder: Posi R 34 K 35 S 36 p 37 S 38 L 39 N 65 N 66 L

I am new in python, really hope someone can help me, thank you so much!!!

suppose we have the results of the split command in a variable

split_list = ['>103L', 'Sequence:', 'MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL', 'Disorder:', '----------------------------------XXXXXX-----------------------------------------------------------------------------------------------------------------------------XX']

lets work with just the important pieces, items 2 and 4,

res_name = split_list[2]  # ( i.e. 'MNIFEML...' )
disorder = split_list[4]  # ( i.e. '-----...XXX')

You can associate the elements of the two arrays like so.

sets = []
for i,c in enumerate( disorder ):
  if c == 'X':
    sets.append( (i, res_name[i]) )

The enumerate command in Python iterates through a list like object and returns an index and an item (i,c) for each member of disorder. At the end of this operation, sets will contain tuples of what we're after, the index numbers where 'X' occurs in disorder and the corresponding residues from res_name.

sets = [(34,'K'), (35,'S') ... ]

If you want to use another nice Python feature you can construct sets in one line using whats called a list comprehension ,

sets = [ (i,res_name[i]) for i,c in enumerate(disorder) if c=='X' ]

This is a quick way to build a list and will be more efficient than the loop although the difference shouldn't matter for on the order of 100 items as you've shown in your example. The only thing left is to write this new data to file. We can create a string in the format that you want by making another list and joining the pieces with a space between them. For each tuple in list we want the string version of the index and the residue name (which is already a string). This can be done list so,

txt = ' '.join( [str(t[0]) + ' ' + t[1] for t in sets] )

the variable txt will now be equal to

>>> txt 
'34 K 35 S 36 P 37 S 38 L 39 N 165 N 166 L'

To write out to a file with the format you specified you can do the following,

f = open( 'test.out', 'w' )
f.write( ' '.join(split_list[0:2]) + '\n' )
f.write( split_list[2] + ' Disorder: Posi R ' + txt )  
f.close()

the first write command puts '>103L Sequence:' on the first line and adds a new line character. The second outputs the original residue sequence and the txt variable we created above.

You can break this up into three distinct parts:

  1. parse the input;
  2. construct the new disorder string;
  3. output the new file.

(1) and (3) are pretty simple, so I'll focus on (2). The main thing you need to do is iterate through your "disorder string" where you can access the character at each position, as well as the position itself. One way to do this is to use enumerate :

for i, x in enumerate(S)

which gives you a generator for each position (stored in i ) and character (stored in x ) in string S . Once you have that, all you need to do is record the position and the character in seq whenever the disorder string has an "X" . In Python, this could look like:

if (x == 'X'):
    new_disorder.append( "{} {}".format(i, seq[i]) )

where we are formatting the result as a string, eg "34 R".

Here's a complete example:

# Parse the file which was already split into split_list
split_list = ['>103L', 'Sequence:', 'MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL', 'Disorder:', '----------------------------------XXXXXX-----------------------------------------------------------------------------------------------------------------------------XX']
header   = split_list[0] + " " + split_list[1]
seq      = split_list[2]
disorder = split_list[4]

# Create the new disorder string
new_disorder = ["Disorder: Posi R"]
for i, x in enumerate(disorder):
    if x == "X":
        # Appends of the form: "AminoAcid Position"
        new_disorder.append( "{} {}".format(i, seq[i]) )

new_disorder = " ".join(new_disorder)

# Output the modified file
open("seq2.txt", "w").write( "\n".join([header, seq, new_disorder]))

Note that I get slightly different output than the example you gave:

103L Sequence:
MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL
Disorder: Posi R 34 K 35 S 36 P 37 S 38 L 39 N 165 N 166 L

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM