简体   繁体   English

计数 DNA PSET6 CS50 中的重复 STR

[英]Counting repeated STR in DNA PSET6 CS50

Currently working on CS50.目前正在研究 CS50。 I tried to count STR in file DNA Sequences but it always overcount.我试图计算文件 DNA 序列中的 STR,但它总是多计。

I mean, for example: how much 'AGATC' in file DNA repeat consecutively.我的意思是,例如:文件 DNA 中有多少“AGATC”连续重复。

This code is only try to find out how to count those repeated DNA accurately.这段代码只是试图找出如何准确计算那些重复的 DNA。

import csv
import re
from sys import argv, exit

def main():
    if len(argv) != 3:
        print("Usage: python dna.py data.csv sequence.txt")
        exit(1)

    with open(argv[1]) as csv_file, open(argv[2]) as dna_file:
        reader = csv.reader(csv_file)
        #for row in reader:
        #    print(row)

        str_sequences = next(reader)[1:]

        dna = dna_file.read()
        for i in range(len(dna)):
            count = len(re.findall(str_sequences[0], dna))   # str_sequences[0] is 'AGATC'
        print(count)

main()

result for DNA file 11 (AGATC): DNA 文件 11 (AGATC) 的结果:

$ python dna.py databases/large.csv sequences/11.txt
52

The result supposed to be 43. But, for small.csv, its count accurately.结果应该是 43。但是,对于 small.csv,它的计数准确。 But for large it always over count.但对于大的来说,它总是过分计算。 Later i know that my code its counting all every match word in DNA file (AGATC).后来我知道我的代码计算了 DNA 文件(AGATC)中的所有匹配词。 But the task is, you have to take the DNA that only repeat consecutively and ignoring if another same DNA showup again.但任务是,您必须获取仅连续重复的 DNA,而忽略另一个相同的 DNA 是否再次出现。

{AGATCAGATCAGATCAGATC(T)TTTTAGATC}

So, how to stop counting if the DNA hit the (T), and it doesn't need to count AGATC that comes after?那么,如果 DNA 击中 (T),如何停止计数,并且不需要计算后面的 AGATC? What should i change in my code?我应该在我的代码中更改什么? especially in re.findall() that i use.特别是在我使用的 re.findall() 中。 Some people said use substring, how to use substring?有人说用substring,怎么用substring? or maybe can i just use regEx like i did?或者我可以像我一样使用 regEx 吗?

Please write your code if you can.如果可以,请编写您的代码。 sorry for my bad english.对不起,我的英语不好。

The for loop is wrong, it will keep counting the sequences even if they are already found earlier in the loop. for 循环是错误的,它会继续计算序列,即使它们已经在循环的早期找到了。 I think you want to instead loop over the str_sequences .我认为您想改为循环遍历str_sequences

Something like:就像是:

seq_list = []

for STR in str_sequences:
    groups = re.findall(rf'(?:{STR})+', dna)
    if len(groups) == 0:
        seq_list.append('0')
    else:
        seq_list.append(str(max(map(lambda x: len(x)//len(STR), groups))))

print(seq_list)

Also, there are many posts on this problem.另外,关于这个问题的帖子很多。 Maybe, you can examine some of them to finish your program.也许,您可以检查其中的一些来完成您的程序。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM