CS50 Problem Set 6, IndexError: list index out of range

Question

I don't know what is wrong here but I get an error message when trying to use a large database, an error keeps popping up. For example:

dna/ $ python dna.py databases/large.csv sequences/10.txt
Traceback (most recent call last):
  File "/workspaces/103840690/dna/dna.py", line 104, in <module>
    main()
  File "/workspaces/103840690/dna/dna.py", line 47, in main
    check[i][j] = False
IndexError: list index out of range

I know this type of error means that I am trying to get to an index that doesn't exist, but anything I try doesn't seem to work. Also it is weird that I only get it when using a large database.

The problem is in line 40 - 49 probably, where is the comment "Check database for matching profiles", I just pasted the whole code for the context

import csv
import sys


def main():

    # Check for command-line usage
    if len(sys.argv) != 3:
        print("Two command-line arguments needed. ")
        return 1


    # Read database file into a variable
    with open(sys.argv[1], "r") as csv_file:
        csv_database = csv.DictReader(csv_file)

        # create a list where we can put dictionaries
        database = []
        for lines in csv_database:
            database.append(lines)

        # create a keys list where we can put STRs
        STRs = []
        for key in database[0].keys():
            STRs.append(key)
        STRs.remove("name")


    # Read DNA sequence file into a variable
    with open(sys.argv[2], "r") as txt_file:
        sequence = txt_file.read()


    # Find longest match of each STR in DNA sequence
    matches = {}
    for i in range(len(STRs)):
        matches[STRs[i]] = longest_match(sequence, STRs[i])

    # Check database for matching profiles
    check = [[0]*len(database)]*len(STRs)
    match = None
    for i in range(len(database)):
        for j in range(len(STRs)):
            if matches[STRs[j]] == int(database[i][STRs[j]]):
                check[i][j] = True
            else:
                check[i][j] = False
        if False not in check[i]:
            match = i

    if match != None:
        print(database[match]["name"])
    else:
        print("No match")

    return


def longest_match(sequence, subsequence):
    """Returns length of longest run of subsequence in sequence."""

    # Initialize variables
    longest_run = 0
    subsequence_length = len(subsequence)
    sequence_length = len(sequence)

    # Check each character in sequence for most consecutive runs of subsequence
    for i in range(sequence_length):

        # Initialize count of consecutive runs
        count = 0

        # Check for a subsequence match in a "substring" (a subset of characters) within sequence
        # If a match, move substring to next potential match in sequence
        # Continue moving substring and checking for matches until out of consecutive matches
        while True:

            # Adjust substring start and end
            start = i + count * subsequence_length
            end = start + subsequence_length

            # If there is a match in the substring
            if sequence[start:end] == subsequence:
                count += 1

            # If there is no match in the substring
            else:
                break

        # Update most consecutive matches found
        longest_run = max(longest_run, count)

    # After checking for runs at each character in seqeuence, return longest run found
    return longest_run


main()

Answer 1

Your indices are in the wrong order. check is a list of len(STRs) Elements. Each is a list with len(database) elements.

   # Check database for matching profiles
    check = [[0]*len(database)]*len(STRs)
    match = None
    for i in range(len(database)):
        for j in range(len(STRs)):
            if matches[STRs[j]] == int(database[i][STRs[j]]):
                check[i][j] = True
            else:
                check[i][j] = False
        if False not in check[i]:
            match = i

You are iterating over the databases with the variable i and over the STRs with the variable j. To match your setup with check the result should be stored in check[j][i] to match the initialization of check .

Answer 2

When you multiply a list, what happens is, the whole list gets multiplied not the elements. See this example.

a = [[0]*2]*5
print(a)
> [[0, 0], [0, 0], [0, 0], [0, 0], [0, 0]]
print(a[4][1])
> 0

As you are using check = [[0]*len(database)]*len(STRs) where the index of the list depends on len(STRs), and If you want to go deeper into that list also, you traverse depending on the value of len(database). You need to modify your code by this one.

for i in range(len(STRs)):
    for j in range(len(database)):
        if matches[STRs[j]] == int(database[i][STRs[j]]):
            check[i][j] = True
        else:
            check[i][j] = False
    if False not in check[i]:
        match = i

CS50 Problem Set 6, IndexError: list index out of range

Question

2 answers

solution1
2 2022-06-29 09:35:27

solution2
1 ACCPTED 2022-06-29 09:42:27

CS50 Problem Set 6, IndexError: list index out of range

Question

2 answers

solution1 2 2022-06-29 09:35:27

solution2 1 ACCPTED 2022-06-29 09:42:27

solution1
2 2022-06-29 09:35:27

solution2
1 ACCPTED 2022-06-29 09:42:27