简体   繁体   English

Python:根据字符索引从文本文件中提取子字符串

[英]Python: Extract substring from text file based on character index

So I have a File with some thousand entries of the form (fasta format, if anyone wants to know): 所以我有一个文件,其中包含数千种形式的文件(fasta格式,如果有人想知道的话):

>scaffold1110_len145113_cov91
TAGAAAATTGAATAATTGATAGTTCTTAACGAAAAGTAAAAGTTTAAAGTATACAGAAATTTCAGGCTATTCACTCTTTT
ATAATCCAAAATTAGAAATACCACACCTTGCATAAAGTTTAAGATATTTACAAAAACCTGAAGTGGATAATCCGAAATCG
...
>Next_Header
ATGCTA...

And I have a python-dictionary from part of my code that contains information like the following for a number of headers: 我从代码的一部分中提取了一个python-dictionary,其中包含类似以下有关许多标头的信息:

{'scaffold1110_len145113_cov91': [[38039, 38854, 106259], [40035, 40186, 104927]]}

This describes the entry by header and a list of start position, end position and rest of characters in that entry (so start=1 means the first character of the line below that corresponding header). 这按标题描述条目,并列出该条目中开始位置,结束位置和其余字符的列表(因此,start = 1表示对应标题下方行的第一个字符)。 [start, end, left] [开始,结束,左]

What I want to do is extract the string for this interval inclusive 25 (or a variable number) of characters in front and behind of it, if the entry allows for, otherwise include all characters to the begin/end. 我想做的是提取此间隔内的字符串,如果输入允许,则在其前后包含25个(或可变数量)字符,否则将所有字符都包含在开始/结尾处。 (like when the start position is 8, I cant include 25 chars in front but only 8.) (例如,当起始位置为8时,我不能在前面包含25个字符,但只能包含8个字符。)

And that for every entry in my dict. 而这对于我的字典中的每个条目都是如此。

Sounds not too hard probably but I am struggling to come up with a clever way to do it. 听起来可能不太难,但我正在努力想出一种巧妙的方法来做到这一点。

For now my idea was to read lines from my file, check if they begin with ">" and look up if they exist in my dict. 现在,我的想法是从文件中读取行,检查它们是否以“>”开头,并查看它们是否存在于我的字典中。 Then add up the chars per line until they exceed my start position and from there somehow manage to get the right part of that line to match my startPos - X . 然后将每行的字符加起来,直到它们超过我的开始位置为止,然后从某种程度上设法使该行的正确部分与我的startPos - X相匹配。

for line in genomeFile:

    line = line.strip()
    if(line[0] == ">"):
        header = line
        currentCluster = foundClusters.get(header[1:])

        if(currentCluster is not None):
            outputFile.write(header + "\n")

    if(currentCluster is not None):

        charCount += len(line)

        # *crazy calculations to find the actual part i want to extract*

I am quite the python beginner so maybe someone has a better idea how to solve this? 我是python的初学者,所以也许有人对如何解决这个问题有更好的主意?

-- While typing this I got the idea to use file.read(startPos-X-1) after a line matches to a header I am looking for to read characters to get to my desired position and from there use file.read((endPos+X - startPos-X)) to extract the part I am looking for. -键入此字符时,我想到了在行与标题匹配之后使用file.read(startPos-X-1)的功能,我正在寻找读取字符以到达所需位置并从那里使用file.read(( endPos + X-startPos-X))提取我要寻找的零件。 If this works it seems pretty easy to accomplish what I want. 如果这行得通,似乎很容易实现我想要的。

I'll post this anyway, maybe someone has an even better way or maybe my idea wont work. 无论如何,我都会张贴此消息,也许有人有更好的方法,或者我的想法行不通。

thanks for any input. 感谢您的任何投入。

EDIT: 编辑:

turns out you cant mix for line in file with file.read(x) since the former uses buffering, soooooo back to the batcave. 事实证明,您不能for line in file file.read(x)混入for line in file因为前者使用缓冲,所以请回到batcave。 also file.read(x) probably counts newlines too, which my data for start and end position do not. 另外file.read(x)可能也计入换行符,而我的开始和结束位置数据不包括在内。

(also fixed some stupid errors in my posted code) (还修复了我发布的代码中的一些愚蠢错误)

Perhaps you could use a function to generate your needed splice indices. 也许您可以使用一个函数来生成所需的接头索引。

def biggerFrame( start, end, left, frameSize=25 ) : #defaults to 25 frameSize
    newStart = start - frameSize
    if newStart < 0 :
        newStart = 0
    if frameSize > left :
        newEnd = left
    else :
        newEnd = end + frameSize
    return newStart, newEnd

With that function, you can add something like the following to your code. 使用该功能,您可以在代码中添加如下内容。

for indices in currentCluster :
    slice, dice = biggerFrame( indices[0], indices[1], indices[2], 50) # frameSize is 50 here; you can make it whatever you want.
    outputFile.write(line[slice:dice] + '\n')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM