I don't know what is wrong here but I get an error message when trying to use a large database, an error keeps popping up. For example:
dna/ $ python dna.py databases/large.csv sequences/10.txt
Traceback (most recent call last):
File "/workspaces/103840690/dna/dna.py", line 104, in <module>
main()
File "/workspaces/103840690/dna/dna.py", line 47, in main
check[i][j] = False
IndexError: list index out of range
I know this type of error means that I am trying to get to an index that doesn't exist, but anything I try doesn't seem to work. Also it is weird that I only get it when using a large database.
The problem is in line 40 - 49 probably, where is the comment "Check database for matching profiles", I just pasted the whole code for the context
import csv
import sys
def main():
# Check for command-line usage
if len(sys.argv) != 3:
print("Two command-line arguments needed. ")
return 1
# Read database file into a variable
with open(sys.argv[1], "r") as csv_file:
csv_database = csv.DictReader(csv_file)
# create a list where we can put dictionaries
database = []
for lines in csv_database:
database.append(lines)
# create a keys list where we can put STRs
STRs = []
for key in database[0].keys():
STRs.append(key)
STRs.remove("name")
# Read DNA sequence file into a variable
with open(sys.argv[2], "r") as txt_file:
sequence = txt_file.read()
# Find longest match of each STR in DNA sequence
matches = {}
for i in range(len(STRs)):
matches[STRs[i]] = longest_match(sequence, STRs[i])
# Check database for matching profiles
check = [[0]*len(database)]*len(STRs)
match = None
for i in range(len(database)):
for j in range(len(STRs)):
if matches[STRs[j]] == int(database[i][STRs[j]]):
check[i][j] = True
else:
check[i][j] = False
if False not in check[i]:
match = i
if match != None:
print(database[match]["name"])
else:
print("No match")
return
def longest_match(sequence, subsequence):
"""Returns length of longest run of subsequence in sequence."""
# Initialize variables
longest_run = 0
subsequence_length = len(subsequence)
sequence_length = len(sequence)
# Check each character in sequence for most consecutive runs of subsequence
for i in range(sequence_length):
# Initialize count of consecutive runs
count = 0
# Check for a subsequence match in a "substring" (a subset of characters) within sequence
# If a match, move substring to next potential match in sequence
# Continue moving substring and checking for matches until out of consecutive matches
while True:
# Adjust substring start and end
start = i + count * subsequence_length
end = start + subsequence_length
# If there is a match in the substring
if sequence[start:end] == subsequence:
count += 1
# If there is no match in the substring
else:
break
# Update most consecutive matches found
longest_run = max(longest_run, count)
# After checking for runs at each character in seqeuence, return longest run found
return longest_run
main()
Your indices are in the wrong order. check is a list of len(STRs) Elements. Each is a list with len(database) elements.
# Check database for matching profiles
check = [[0]*len(database)]*len(STRs)
match = None
for i in range(len(database)):
for j in range(len(STRs)):
if matches[STRs[j]] == int(database[i][STRs[j]]):
check[i][j] = True
else:
check[i][j] = False
if False not in check[i]:
match = i
You are iterating over the databases with the variable i and over the STRs with the variable j. To match your setup with check the result should be stored in check[j][i]
to match the initialization of check
.
When you multiply a list, what happens is, the whole list gets multiplied not the elements. See this example.
a = [[0]*2]*5
print(a)
> [[0, 0], [0, 0], [0, 0], [0, 0], [0, 0]]
print(a[4][1])
> 0
As you are using check = [[0]*len(database)]*len(STRs)
where the index of the list depends on len(STRs), and If you want to go deeper into that list also, you traverse depending on the value of len(database). You need to modify your code by this one.
for i in range(len(STRs)):
for j in range(len(database)):
if matches[STRs[j]] == int(database[i][STRs[j]]):
check[i][j] = True
else:
check[i][j] = False
if False not in check[i]:
match = i
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.