简体   繁体   中英

Python reading a complicated .txt file

I have a .txt with data like this:

Header:ensembl gene ID|Ensembl Transcript ID|CDS start|CDS end|5'UTR start|5'UTR end|3'UTR start|3'UTR end|Transcripts start|Transcripts end
>ENSMUSG00000002477|ENSMUST00000002551|*some junk information*...etc.|
TCGCGCGTCCGCAGGCCTCCGCGCGCTTTTCCG....etc.
>ENSMUSG00000002835|ENSMUST00000002914|...etc.|
GCAGAAGTGACACCGGTGGGAGGCG...etc.

I have codes written to get me to a point I have the names ENSMUSG0000000xxxx

I want to pick out the names I have from the .txt with the next line eg"TACGTACG" read in a triple form eg"TAC" "GTA"

And then I want to do the same thing but instead of reading from the 1st letter I want to start at the 2nd, using the above example it will read "ACG" and "TAG"

and the same thing again but skip the first 2 letters

I really don't know how would I do it especially the reading 3 letters part. Can someone give me a hand please?

These are the codes I have so far:

import csv
import os.path
#open files + readlines
with open("C:/Users/Ivan Wong/Desktop/Placement/Lists of targets/Mouse/UCSC to Ensembl.csv", "r") as f:
reader = csv.reader(f, delimiter = ',')
#find files with the name in 1st row
for row in reader:
    graph_filename = os.path.join("C:/Users/Ivan Wong/Desktop/Placement/Interesting reading/3'ORF",row[0]+"_nt_counts.txt.png")
    if os.path.exists(graph_filename):
        y = row[0]+'_nt_counts.txt'  
        r = open('C:/Users/Ivan Wong/Desktop/Placement/fp_mesc_nochx/'+y, 'r')
        k = r.readlines()
        r.close
        del k[:1]
        k = map(lambda s: s.strip(), k)
        interger = map(int, k)   
        import itertools
        #adding the numbers for every 3 rows
        def grouper(n, iterable, fillvalue=None):
            "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
            args = [iter(iterable)] * n
            return itertools.izip_longest(*args, fillvalue=fillvalue)
        result = map(sum, grouper(3, interger, 0))
        e = row[1]
cDNA = open('C:/Users/Ivan Wong/Desktop/Placement/Downloaded seq/Mouse/cDNA.txt', 'r')
q = cDNA.readlines()
cDNA.close
#To delete the 1st line that I do not want at all
del q[:1]

Now I just have an idea, and I want to break them down by steps

1st: i want to find out the names (I named it e) in the list from my .txt (named q)

2nd: I want to make it read the next line until it reaches another name (e)

3rd: break those lines I read into a single string like this "A", "T", "C", "G", "A", "A" etc.

4th: do the read 3 letters thing so - "ATC", "GAA"

5th: write them into a file, then go back to 4th step but this time make it start with the 2nd letter

6th: basically 5th step but start on the 3rd letter this time

Although I have this idea, I do not have the programming knowledge to do this, can someone please help me

Since this is not homework here's a way to get started. Assuming the lines that you are interested are those that don't start with '>' the slicing operation will help here.

with open('data.txt') as inf:
    for line in inf:
        if not line.startswith('>'):
            strings3 = [line[i:i+3]for i in range(len(line))]

will collect the 3 letter sequences you are interested in on each line:

Input line:

GCAGAAGTGACACCGGTGGGAGGCG

Output

['GCA', 'CAG', 'AGA', 'GAA', 'AAG', 'AGT', 'GTG', 'TGA', 'GAC', 'ACA', 'CAC', 'ACC', 'CCG', 'CGG', 'GGT', 'GTG', 'TGG', 'GGG', 'GGA', 'GAG', 'AGG', 'GGC', 'GCG', 'CG\n', 'G\n', '\n']

Note that if the number of characters on the line aren't evenly divisible by 3 you'l get some shorter strings and also some newlines.

You also might be able to re-use the grouper function from the other question you just posed recently.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM