Python 代码在 CS50x 的 DNA 问题中运行时间过长

Question

I have written a code to solve the DNA problem from week 6 of CS50.从 CS50 的第 6 周开始，我编写了一个代码来解决 DNA 问题。 However, when I am running it on large.csv databases and sequences, it takes at least one minute to produce an output.但是，当我在 large.csv 数据库和序列上运行它时，生成 output 至少需要一分钟。 On small.csv it produces an output instantly.在 small.csv 上，它会立即生成 output。 Because of it, I can't pass check50.正因为如此，我无法通过check50。 I suppose that the problem is on the stage of a function for generating the number of max repeating sequences of STRs, but I don't know how to write it more efficiently.我想问题出在 function 的阶段，用于生成 STR 的最大重复序列数，但我不知道如何更有效地编写它。 The full description of the problem is here: https://cs50.harvard.edu/x/2021/psets/6/dna/#:~:text=check50%20cs50/problems/2021/x/dna问题的完整描述在这里： https://cs50.harvard.edu/x/2021/psets/6/dna/#:~:text=check50%20cs50/problems/2021/x/dna

And here are the source files for databases and sequences: https://cdn.cs50.net/2019/fall/psets/6/dna/以下是数据库和序列的源文件： https://cdn.cs50.net/2019/fall/psets/6/dna/

This is my code:这是我的代码：

import csv
import sys


def main():
    
    # check a proper input
    if len(sys.argv) != 3:
        sys.exit("Usage: python dna.py data.csv sequence.txt")
    
    # create a list for all data
    data_all = []
    
    # create a list for all STRs
    STR_all = []
    
    # write data to list
    with(open(sys.argv[1])) as data:
        reader = csv.DictReader(data)
        for row in reader:
            row["name"]
            data_all.append(row)
            
    # write header to a list 
    with(open(sys.argv[1])) as data:      
        reader = csv.reader(data)
        headings = next(reader)
        STR_all.append(headings)
    
    # delete "name" from header, it is on the first position    
    STR_all = STR_all[0]
    STR_all.pop(0)
            
    # create a string with DNA sequence
    with(open(sys.argv[2])) as seq:
        line = seq.read()
    
    # create a list with max number of repeating STR from a line(DNA)
    max_seq = []
    
    # enter data with string of STR and it's max repeating time    
    for i in range(len(STR_all)):
        result = f"{compare(STR_all[i], line)}"
        max_seq.append(result)
        
    # create a dictionary with a list of all STRs and according number of repeating sequences
    STR_with_max_seq = dict(zip(STR_all, max_seq))
    
    # compare values from data_all and STR_with_max_seq
    for i in range(len(data_all)):
        # delete name key and store key in variable "name"
        name = data_all[i].pop('name')
        if data_all[i] == STR_with_max_seq:
            print(name)
            sys.exit()
            break
        else:
            continue
        
    # Print if no match found
    print("No match")
        
    # variables that I used to check on different stages of writing a program
            
    # print(data_all)
    # print(line)
    # print(STR_all)
    # print(max_seq)
    # print(STR_with_max_seq)
    
    # print(len(data_all))
    # print(name)

    
def compare(STR, DNA):

    for key in DNA:
        l = len(STR)
        tmp_max = 0
        tmp = 0
        
        # iteration through the whole length of DNA
        for i in range(len(DNA)):
            if tmp > 0:
                tmp = 0
            
            # enters if sequences are equal
            if DNA[i: i + l] == STR:
                tmp += 1
                # increments tmp if its sequence repeats
                while DNA[i - l: i] == DNA[i: i + l]:
                    tmp += 1
                    i += l
                # update the max found number of repeating sequences    
                if tmp > tmp_max:
                    tmp_max = tmp
    
    return tmp_max

    
main()

UPDATE: I have used time.monotonic() to check the total time for the code execution in main().更新：我使用 time.monotonic() 检查 main() 中代码执行的总时间。 This is time for small.csv:这是 small.csv 的时间：

Value of the monotonic clock (in fractional seconds): 661689.405232647单调时钟的值（以小数秒为单位）：661689.405232647
Time elapsed during the process: 0.02439890895038843过程中经过的时间：0.02439890895038843

And this is for large.csv:这适用于 large.csv：

Value of the monotonic clock (in fractional seconds): 661943.13288005单调时钟的值（以小数秒为单位）：661943.13288005
Time elapsed during the process: 108.33000503003132过程中经过的时间：108.33000503003132

Answer 1

I know that problem.我知道那个问题。 There are some parts of your code that make it slow.您的代码的某些部分使其变慢。

First, lets try to read each file only once.首先，让我们尝试只读取每个文件一次。 For instance:例如：

with(open(sys.argv[1])) as data:
    reader = csv.DictReader(data)
    STR_all = reader.fieldnames
    for row in reader:
        row["name"]
        data_all.append(row)

STR_all will be a list, so you can remove the line: STR_all 将是一个列表，因此您可以删除该行：

STR_all = STR_all[0]

You can build the dictionary of counters when comparing, so you can avoid iterating twice.比较时可以建立计数器的字典，这样可以避免重复两次。

For example, by doing this:例如，通过这样做：

    # enter data with string of STR and it's max repeating time    
    for i in range(len(STR_all)):
        key = STR_all[i]
        STR_with_max_seq[key] = str(compare(key, line))

You can remove this:你可以删除这个：

    # create a dictionary with a list of all STRs and according number of repeating sequences
    STR_with_max_seq = dict(zip(STR_all, max_seq))

Finally, to improve the function compare, you can avoid the first loop.最后，改进function比较，可以避免第一个循环。 You want to find the maximum number of successive occurrences of STR inside DNA.您想找出 DNA 中 STR 连续出现的最大次数。 So, you have only to iterate through DNA by windows of the length of STR and comparing them.因此，您只需通过 STR 长度的 windows 迭代 DNA 并比较它们。 For example:例如：

def compare(STR, DNA):
    l = len(STR)
    tmp_max = 0
    tmp = 0
    i = 0
        
    # iteration through the whole length of DNA
    while i < len(DNA) - l:  # make sure the last str has length = l
        SSTR = DNA[i : i + l]  # Extract a substre of length l

        if SSTR == STR:
            # You can jump l positions here.
            i += l
            tmp += 1
        else:
            i += 1
            if tmp > tmp_max:
                tmp_max = tmp
            tmp = 0                
    
    return tmp_max

Python 代码在 CS50x 的 DNA 问题中运行时间过长

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-01-24 16:12:48

Python 代码在 CS50x 的 DNA 问题中运行时间过长

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-01-24 16:12:48

解决方案1
1 已采纳 2021-01-24 16:12:48