简体   繁体   English

查找核苷酸在同一个 position 中出现的次数

[英]Finding how many times a nucleotide appear in the same position

I'm new to python and im trying to solve a question which I am given a few dna sequences, for example: sequences = ["GAGGTAAACTCTG", "TCCGTAAGTTTTC", "CAGGTTGGAACTC", "ACAGTCAGTTCAC", "TAGGTCATTACAG", "TAGGTACTGATGC"]我是 python 的新手,我试图解决一个问题,我得到了一些 dna 序列,例如: sequences = ["GAGGTAAACTCTG", "TCCGTAAGTTTTC", "CAGGTTGGAACTC", "ACAGTCAGTTCAC", "TAGGTCATTACAG", "TAGGTACTGATGC"]

I want to know how many times the nucleotide "A" is in each position of all of those sequences (the answer should be 'A': [1, 4, 1, 0, 0, 3, 4, 1, 1, 3, 0, 2, 0] in that case).我想知道所有这些序列的每个 position 中有多少次核苷酸“A”(答案应该是“A”:[1、4、1、0、0、3、4、1、1、3 , 0, 2, 0] 在这种情况下)。 what I tried to do is:我试图做的是:

'A_pos = {"A":[sum(int(i[0]=="A") for i in sequences), sum(int(i[1]=="A") for i in sequences), sum(int(i[2]=="A") for i in sequences),'

and so on to each position in the index.依此类推到索引中的每个 position。 Im trying to make it check all the positions at once instead of doing each position manually.我试图让它一次检查所有位置,而不是手动执行每个 position。

The code you posted is only partial, but you are iterating over sequences once per index.您发布的代码只是部分代码,但您在每个索引上迭代sequences一次。 You can count them in a single pass using zip (even if in the end you have to read each char once, so my solution only changes the reading order):您可以使用zip一次计算它们(即使最后您必须读取每个字符一次,所以我的解决方案只更改读取顺序):

A = []
for s in zip(*sequences):
    print(s)
    num_a = 0
    for nuc in s:
        if nuc == "A":
            num_a += 1
    A.append(num_a)
print(A)

The content of s are: s的内容是:

('G', 'T', 'C', 'A', 'T', 'T')
('A', 'C', 'A', 'C', 'A', 'A')
('G', 'C', 'G', 'A', 'G', 'G')

And so on, so you see that all the sequences are read one character at a time, and the result is:以此类推,您会看到所有序列一次读取一个字符,结果是:

[1, 4, 1, 0, 0, 3, 4, 1, 1, 3, 0, 2, 0]

If the sequences are not all of the same length, you can use itertools.zip_longest to pad the shorter sequences with another character.如果序列的长度不同,您可以使用itertools.zip_longest用另一个字符填充较短的序列。

Cheers!干杯!

You're close, but you need to keep track of the index rather than the individual lookups您已经接近了,但您需要跟踪索引而不是单个查找

[sum(x[i] == "A" for x in sequences) for i in range(len(sequences[0]))]

This will iterate through each index simultaneously and add one for each nucleotide occurrence.这将同时遍历每个索引,并为每个核苷酸出现添加一个。

result = {'A': 13*[0], 'G': 13*[0], 'T': 13*[0], 'C': 13*[0]}
for index, sequence in enumerate(zip(*sequences)):
    for nucleotide in sequence:
        result[nucleotide][index] += 1

Output: Output:

{'A': [1, 4, 1, 0, 0, 3, 4, 1, 1, 3, 0, 2, 0], 'G': [1, 0, 4, 6, 0, 0, 1, 3, 1, 0, 0, 1, 2], 'T': [3, 0, 0, 0, 6, 1, 0, 2, 3, 3, 2, 3, 0], 'C': [1, 2, 1, 0, 0, 2, 1, 0, 1, 0, 4, 0, 4]}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM