简体   繁体   English

如何使用python对齐和比较列表中的两个元素(序列)

[英]How to align and compare two elements (sequence) in a list using python

here is my question: 这是我的问题:

I've got a file which looks like this: 我有一个看起来像这样的文件:

103L Sequence: MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL Disorder: ----------------------------------XXXXXX-----------------------------------------------------------------------------------------------------------------------------XX 103L顺序:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAY -------------------------------- -------------------------------------------------- -------------------------------------------------- ---------------- XX

It contains name, which in this case is 103L; 它包含名称,在这种情况下为103L。 protein sequence, which has "Sequence:" label; 具有“ Sequence:”标签的蛋白质序列; disorder region, which is after "Disorder:". 无序区域,在“无序:”之后。 the "-" represent that this position is ordered, and "X" represent that this particular position is disordered. “-”表示此位置是有序的,“ X”表示此特定位置是无序的。 For example, that last two "XX" under disorder represent that the last two position of the protein sequence is disordered, which is "NL". 例如,在无序状态下的最后两个“ XX”表示蛋白质序列的后两个位置是无序的,即“ NL”。 After I use split method, it looks like this: 使用分割方法后,它看起来像这样:

['>103L', 'Sequence:', 'MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL', 'Disorder:', '----------------------------------XXXXXX-----------------------------------------------------------------------------------------------------------------------------XX'] ['> 103L','Sequence:','MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKKDEDEAVNLAKSRWYNQTPNRAKRVIT ----- XXXXXX -------------------------------------------- -------------------------------------------------- ------------------------------- XX']

I want to use python to find the disorder sequence and its position. 我想使用python查找无序序列及其位置。 So the final file should look somewhat like this: Name Sequence: 'real sequence' Disorder: position(Posi) residue-name(R) Take 103L as an example: 因此,最终文件应如下所示:名称序列:“真实序列”乱序:position(Posi)残差名称(R)以103L为例:

103L Sequence: MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL Disorder: Posi R 34 K 35 S 36 p 37 S 38 L 39 N 65 N 66 L 103L序列:MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL 34 S N L S 37 K 37 S 37 L 34

I am new in python, really hope someone can help me, thank you so much!!! 我是python的新手,真的希望有人可以帮助我,非常感谢!!!

suppose we have the results of the split command in a variable 假设我们在一个变量中有split命令的结果

split_list = ['>103L', 'Sequence:', 'MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL', 'Disorder:', '----------------------------------XXXXXX-----------------------------------------------------------------------------------------------------------------------------XX']

lets work with just the important pieces, items 2 and 4, 让我们只处理重要的项目2和4

res_name = split_list[2]  # ( i.e. 'MNIFEML...' )
disorder = split_list[4]  # ( i.e. '-----...XXX')

You can associate the elements of the two arrays like so. 您可以像这样关联两个数组的元素。

sets = []
for i,c in enumerate( disorder ):
  if c == 'X':
    sets.append( (i, res_name[i]) )

The enumerate command in Python iterates through a list like object and returns an index and an item (i,c) for each member of disorder. Python中的enumerate命令遍历类似对象的列表,并为每个混乱成员返回一个索引和一个项(i,c)。 At the end of this operation, sets will contain tuples of what we're after, the index numbers where 'X' occurs in disorder and the corresponding residues from res_name. 在该操作结束时,集合将包含我们所要查找的元组,“ X”无序出现的索引号以及res_name中的相应残基。

sets = [(34,'K'), (35,'S') ... ]

If you want to use another nice Python feature you can construct sets in one line using whats called a list comprehension , 如果您想使用另一种不错的Python功能,则可以使用所谓的列表推导功能在一行中构造集合,

sets = [ (i,res_name[i]) for i,c in enumerate(disorder) if c=='X' ]

This is a quick way to build a list and will be more efficient than the loop although the difference shouldn't matter for on the order of 100 items as you've shown in your example. 这是一种构建列表的快速方法,并且比循环更有效,尽管如示例所示,差异对于100个项目的数量无关紧要。 The only thing left is to write this new data to file. 剩下的唯一事情就是将这个新数据写入文件。 We can create a string in the format that you want by making another list and joining the pieces with a space between them. 我们可以创建另一个列表,然后将各个片段之间留有空格,以所需的格式创建字符串。 For each tuple in list we want the string version of the index and the residue name (which is already a string). 对于列表中的每个元组,我们都希望索引的字符串版本和残差名称(已经是字符串)。 This can be done list so, 可以这样列出来,

txt = ' '.join( [str(t[0]) + ' ' + t[1] for t in sets] )

the variable txt will now be equal to 变量txt现在等于

>>> txt 
'34 K 35 S 36 P 37 S 38 L 39 N 165 N 166 L'

To write out to a file with the format you specified you can do the following, 要写出指定格式的文件,可以执行以下操作:

f = open( 'test.out', 'w' )
f.write( ' '.join(split_list[0:2]) + '\n' )
f.write( split_list[2] + ' Disorder: Posi R ' + txt )  
f.close()

the first write command puts '>103L Sequence:' on the first line and adds a new line character. 第一个写命令将'> 103L Sequence:'放在第一行,并添加一个新行字符。 The second outputs the original residue sequence and the txt variable we created above. 第二个输出原始的残基序列和我们在上面创建的txt变量。

You can break this up into three distinct parts: 您可以将其分为三个不同的部分:

  1. parse the input; 解析输入;
  2. construct the new disorder string; 构造新的无序字符串;
  3. output the new file. 输出新文件。

(1) and (3) are pretty simple, so I'll focus on (2). (1)和(3)非常简单,因此我将重点介绍(2)。 The main thing you need to do is iterate through your "disorder string" where you can access the character at each position, as well as the position itself. 您需要做的主要事情是遍历“混乱字符串”,您可以在其中访问每个位置的字符以及位置本身。 One way to do this is to use enumerate : 一种方法是使用enumerate

for i, x in enumerate(S)

which gives you a generator for each position (stored in i ) and character (stored in x ) in string S . 它为字符串S每个位置(存储在i )和字符(存储在x )提供了一个生成器 Once you have that, all you need to do is record the position and the character in seq whenever the disorder string has an "X" . 一旦有了这些,您需要做的就是在无序字符串具有"X"的位置将位置和字符记录在seq In Python, this could look like: 在Python中,这可能类似于:

if (x == 'X'):
    new_disorder.append( "{} {}".format(i, seq[i]) )

where we are formatting the result as a string, eg "34 R". 我们将结果格式化为字符串,例如“ 34 R”。

Here's a complete example: 这是一个完整的示例:

# Parse the file which was already split into split_list
split_list = ['>103L', 'Sequence:', 'MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL', 'Disorder:', '----------------------------------XXXXXX-----------------------------------------------------------------------------------------------------------------------------XX']
header   = split_list[0] + " " + split_list[1]
seq      = split_list[2]
disorder = split_list[4]

# Create the new disorder string
new_disorder = ["Disorder: Posi R"]
for i, x in enumerate(disorder):
    if x == "X":
        # Appends of the form: "AminoAcid Position"
        new_disorder.append( "{} {}".format(i, seq[i]) )

new_disorder = " ".join(new_disorder)

# Output the modified file
open("seq2.txt", "w").write( "\n".join([header, seq, new_disorder]))

Note that I get slightly different output than the example you gave: 请注意,我得到的输出与您给出的示例略有不同:

103L Sequence:
MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL
Disorder: Posi R 34 K 35 S 36 P 37 S 38 L 39 N 165 N 166 L

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM