简体   繁体   中英

How to write the lists of lists generated by this for loop to a file (Python)?

I am having some trouble with getting the output lists from my one hot encoder to be written into a file. Here is the code showing how these lists are generated. The input files contain several DNA sequences which look like this, as an example:

>seq1
AGTAGATAG
>seq2
GGTTAACCG

This is the Python code:

import sys

path = sys.argv[1]
file = open(path, 'r')

holder = file.read()
holder1 = str(holder)
holder2 = holder1.replace("\n","")
uppercase = holder2.upper()

import re
test = re.sub('1|2|3|4|5|6|7|8|9|0|\t|SEQ|CHR|-|:', "", uppercase)
newone = test.split(">")
newone = [x for x in newone if x]

#checking for presence of N in sequence
lettern = "N"
result = [component for component in lettern if(component in newone)]

#if N is present in sequence, an error message is displayed

for line in newone:

    if (bool(result)) == True:
        print("The input sequence is invalid, N is present.")
        sys.exit()

#if sequence is in the correct format, proceed with one hot encoding

    else:   
     #mapping of bases to integers as a dictionary
        bases = "ATCG"
        base_to_integer = dict((i, c) for c, i in enumerate(bases))

    #encoding input sequence as integers

        integer_encoded = [base_to_integer[y] for y in line]      

    #one hot encoding
        onehot_encoded = list()
        for value in integer_encoded:
            base = [0 for x in range(len(bases))]
            base[value] = 1
            onehot_encoded.append(base)
        print(onehot_encoded)

I have tried amending the for loop at the end in many different ways, but I still cannot get it to write the whole output into one file, it usually ends up showing me the last encoded sequence only. This is the closest I got to a solution:

        onehot_encoded = list()
        temporal = list()
        for value in integer_encoded:
            base = [0 for x in range(len(bases))]
            base[value] = 1
            onehot_encoded.append(base)
        temporal.extend(onehot_encoded)

        with open("output", "a") as file:
            file.write(str(temporal))
        file.close()

However, this ends up repeating the loop, and also shows a very strange-looking jumble of my username and server name in the Linux after I run it and view the output file.

I would really appreciate any help with getting this whole output into one file.

It seems that your problem is, you are resetting the output structure inside the loop, so when you try to print to file only the last encoding is available.

I say seems because your code is quite complicated, in particular you keep recomputing things that can be put outside of the loop.

To emend your program ① put the stop condition outside of the loop, ② open the output file before starting the loop, ③ use a dictionary to precompute the encodings for the different bases ④ simplify the loop because we can do a dictionary lookup instead of recomputing the encoding every time and ⑤ print to the output file using the keyword argument file=...

stop_letters = "N"
for stop_letter in stop_letters:
    if stop_letter in newone : sys.exit()

out = open(..., 'w')
d = {"A":[1,0,0,0], "T":[0,1,0,0], "C":[0,0,1,0], "G":[0,0,0,1,]}

for bases in newone:
    onehot_encoded = [d[base] for base in bases]
    print(onehot_encoded, file=out)

out.close()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM