Why is looping through a file line by line printing too many results?

Question

I'm looping through lines in a file to create a dict with the start/stop positions, however, am getting way too many results and I'm unsure why. It looks like every addition of the variable ref_start and ref_end is being added multiple times in the dictionary.

def main():

    #initialize variables for counts
    gb_count = 0
    glimmer_count = 0
    exact_count = 0
    five_prime_count = 0 
    three_prime_count = 0
    no_matches_count = 0

    #protein_id list
    protein_id = []

    #initialize lists for start/stop coordinates
    reference = []
    prediction = []

    #read in GeneBank file
    for line in open('file'):

        line = line.rstrip()

        if "protein_id=" in line:
            pro_id = line.split("=")
            pro_id = pro_id[1].replace('"','')
            protein_id.append(pro_id)

        elif "CDS" in line:
            if "join" in line:
                continue

            elif "/translation" in line:
                continue

            elif "P" in line:
                continue

            elif "complement" in line:
                value = " ".join(line.split()).replace('CDS','').replace("(",'').replace(")",'').split("complement")
                newValue = value[1].split("..")
                ref_start = newValue[1]
                ref_end = newValue[0]
                gb_count += 1


            else:
                test = " ".join(line.split()).replace('CDS','').split("..")
                ref_start = test[0]
                ref_end = test[1]
                gb_count += 1
            reference.append({'refstart': ref_start, 'refend': ref_end})
            print(reference)

Answer 1

I initially posted something else that was wrong, but I copied over the code and ran a dummy file and I think I figured it out. Your problem is: for line in open('file').

What it is doing (what it did for me) is loading the file up by character. Instead of 'line' = "protein_id=", you're getting 'line' = "p" then 'line' = "r", etc.

The fix is too simple. This is what I did:

file = open('file')
for line in file:

I'm not 100% on this explanation, but I think it has to do with the way python is loading the file. Since it hasn't been established as one long string, it's loading up each individual element. Once it has been made a string, it can break it down by line. Hope this helped.

Why is looping through a file line by line printing too many results?

Question

1 answers

solution1
0 2020-03-10 05:04:42

Why is looping through a file line by line printing too many results?

Question

1 answers

solution1 0 2020-03-10 05:04:42

solution1
0 2020-03-10 05:04:42