简体   繁体   中英

Python: Extract substring from text file based on character index

So I have a File with some thousand entries of the form (fasta format, if anyone wants to know):

>scaffold1110_len145113_cov91
TAGAAAATTGAATAATTGATAGTTCTTAACGAAAAGTAAAAGTTTAAAGTATACAGAAATTTCAGGCTATTCACTCTTTT
ATAATCCAAAATTAGAAATACCACACCTTGCATAAAGTTTAAGATATTTACAAAAACCTGAAGTGGATAATCCGAAATCG
...
>Next_Header
ATGCTA...

And I have a python-dictionary from part of my code that contains information like the following for a number of headers:

{'scaffold1110_len145113_cov91': [[38039, 38854, 106259], [40035, 40186, 104927]]}

This describes the entry by header and a list of start position, end position and rest of characters in that entry (so start=1 means the first character of the line below that corresponding header). [start, end, left]

What I want to do is extract the string for this interval inclusive 25 (or a variable number) of characters in front and behind of it, if the entry allows for, otherwise include all characters to the begin/end. (like when the start position is 8, I cant include 25 chars in front but only 8.)

And that for every entry in my dict.

Sounds not too hard probably but I am struggling to come up with a clever way to do it.

For now my idea was to read lines from my file, check if they begin with ">" and look up if they exist in my dict. Then add up the chars per line until they exceed my start position and from there somehow manage to get the right part of that line to match my startPos - X .

for line in genomeFile:

    line = line.strip()
    if(line[0] == ">"):
        header = line
        currentCluster = foundClusters.get(header[1:])

        if(currentCluster is not None):
            outputFile.write(header + "\n")

    if(currentCluster is not None):

        charCount += len(line)

        # *crazy calculations to find the actual part i want to extract*

I am quite the python beginner so maybe someone has a better idea how to solve this?

-- While typing this I got the idea to use file.read(startPos-X-1) after a line matches to a header I am looking for to read characters to get to my desired position and from there use file.read((endPos+X - startPos-X)) to extract the part I am looking for. If this works it seems pretty easy to accomplish what I want.

I'll post this anyway, maybe someone has an even better way or maybe my idea wont work.

thanks for any input.

EDIT:

turns out you cant mix for line in file with file.read(x) since the former uses buffering, soooooo back to the batcave. also file.read(x) probably counts newlines too, which my data for start and end position do not.

(also fixed some stupid errors in my posted code)

Perhaps you could use a function to generate your needed splice indices.

def biggerFrame( start, end, left, frameSize=25 ) : #defaults to 25 frameSize
    newStart = start - frameSize
    if newStart < 0 :
        newStart = 0
    if frameSize > left :
        newEnd = left
    else :
        newEnd = end + frameSize
    return newStart, newEnd

With that function, you can add something like the following to your code.

for indices in currentCluster :
    slice, dice = biggerFrame( indices[0], indices[1], indices[2], 50) # frameSize is 50 here; you can make it whatever you want.
    outputFile.write(line[slice:dice] + '\n')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM