简体   繁体   中英

Finding how many times a nucleotide appear in the same position

I'm new to python and im trying to solve a question which I am given a few dna sequences, for example: sequences = ["GAGGTAAACTCTG", "TCCGTAAGTTTTC", "CAGGTTGGAACTC", "ACAGTCAGTTCAC", "TAGGTCATTACAG", "TAGGTACTGATGC"]

I want to know how many times the nucleotide "A" is in each position of all of those sequences (the answer should be 'A': [1, 4, 1, 0, 0, 3, 4, 1, 1, 3, 0, 2, 0] in that case). what I tried to do is:

'A_pos = {"A":[sum(int(i[0]=="A") for i in sequences), sum(int(i[1]=="A") for i in sequences), sum(int(i[2]=="A") for i in sequences),'

and so on to each position in the index. Im trying to make it check all the positions at once instead of doing each position manually.

The code you posted is only partial, but you are iterating over sequences once per index. You can count them in a single pass using zip (even if in the end you have to read each char once, so my solution only changes the reading order):

A = []
for s in zip(*sequences):
    print(s)
    num_a = 0
    for nuc in s:
        if nuc == "A":
            num_a += 1
    A.append(num_a)
print(A)

The content of s are:

('G', 'T', 'C', 'A', 'T', 'T')
('A', 'C', 'A', 'C', 'A', 'A')
('G', 'C', 'G', 'A', 'G', 'G')

And so on, so you see that all the sequences are read one character at a time, and the result is:

[1, 4, 1, 0, 0, 3, 4, 1, 1, 3, 0, 2, 0]

If the sequences are not all of the same length, you can use itertools.zip_longest to pad the shorter sequences with another character.

Cheers!

You're close, but you need to keep track of the index rather than the individual lookups

[sum(x[i] == "A" for x in sequences) for i in range(len(sequences[0]))]

This will iterate through each index simultaneously and add one for each nucleotide occurrence.

result = {'A': 13*[0], 'G': 13*[0], 'T': 13*[0], 'C': 13*[0]}
for index, sequence in enumerate(zip(*sequences)):
    for nucleotide in sequence:
        result[nucleotide][index] += 1

Output:

{'A': [1, 4, 1, 0, 0, 3, 4, 1, 1, 3, 0, 2, 0], 'G': [1, 0, 4, 6, 0, 0, 1, 3, 1, 0, 0, 1, 2], 'T': [3, 0, 0, 0, 6, 1, 0, 2, 3, 3, 2, 3, 0], 'C': [1, 2, 1, 0, 0, 2, 1, 0, 1, 0, 4, 0, 4]}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM