简体   繁体   English

Python 3.x - 如何有效地将对象数组拆分为较小的批处理文件?

[英]Python 3.x - How to efficiently split an array of objects into smaller batch files?

I'm fairly new to Python and I'm attempting to split a textfile where entries consists of two lines into batches of max. 我是Python的新手,我正在尝试将一个文本文件拆分成两个行,其中包含最多两个行。 400 objects. 400个物体。

The data I'm working with are thousands of sequences in FASTA format (plain text with a header, used in bioinformatics) where entries look like this: 我正在使用的数据是FASTA格式的数千个序列(带有标题的纯文本,用于生物信息学),其中条目如下所示:

>HORVU6Hr1G000325.5 > HORVU6Hr1G000325.5

PIPPPASHFHPHHQNPSAATQPLCAAMAPAAKKPPLKSSSSHNSAAGDAA PIPPPASHFHPHHQNPSAATQPLCAAMAPAAKKPPLKSSSSHNSAAGDAA

>HORVU6Hr1G000326.1 > HORVU6Hr1G000326.1

MVKFTAEELRGIMDKKNNIRNMSVIAHVD MVKFTAEELRGIMDKKNNIRNMSVIAHVD

... ...

In Biopython, there is a parser SeqIO.parse which allows to access these as an array of objects consisting of IDs and strings, which I need to use in later parts of my code, and since I need to be memory efficient, I'd like to avoid reading/parsing the source file more times than necessary. 在Biopython中,有一个解析器SeqIO.parse允许将这些作为由ID和字符串组成的对象数组来访问,我需要在代码的后续部分使用它,因为我需要内存效率,所以我会喜欢避免读取/解析源文件的次数超过必要的次数。

In Biopython manual, there's a recommended way to do this via a generator, which I'm using: https://biopython.org/wiki/Split_large_file 在Biopython手册中,有一种推荐的方法通过我正在使用的生成器来完成这个: https ://biopython.org/wiki/Split_large_file

However, I'm using Python 3.7 whilst the code there is in Python 2.x, so there are definitely some changes necessary. 但是,我使用的是Python 3.7,而Python 2.x中有代码,因此肯定会有一些必要的更改。 I've changed the 我改变了

entry = iterator.next() entry = iterator.next()

into

entry = next(iterator) entry = next(迭代器)

but I'm not sure if that's all I need to change. 但我不确定这是否需要改变。

Here's the code: 这是代码:

def batch_iterator(iterator, batch_size=400):
    """Returns lists of length batch_size."""
    entry = True  # Make sure we loop once
    while entry:
        batch = []
        while len(batch) < batch_size:
            try:
                entry = next(iterator)
            except StopIteration:
                entry = None

            if entry is None:
                # End of file
                break
            batch.append(entry)
        if batch:
            yield batch

while True:
    bsequence = input("Please enter the full path to your FASTA file(e.g. c:\\folder1\\folder2\\protein.fasta):\n")
    try:
        fastafile = open(bsequence)
        break
    except:
        print("File not found!\n")            


record_iter = SeqIO.parse(fastafile,"fasta")
num = 0
for line in fastafile:
    if line.startswith(">"):
        num += 1

print("num=%i" % (num,))
if num > 400:
    print("The specified file contains %i sequences. It's recommended to split the FASTA file into batches of max. 400 sequences.\n" % (num,))
    while True:
        decision = input("Do you wish to create batch files? (Original file will not be overwritten)\n(Y/N):")
        if (decision == 'Y' or 'y'):
            for i, batch in enumerate(batch_iterator(record_iter, 400), 1):
                filename = "group_%i.fasta" % (i + 1)
                with open(filename, "w") as handle:
                    count = SeqIO.write(batch, handle, "fasta")
                print("Wrote %i records to %s" % (count, filename))
            break
        elif (decision == 'N' or 'n'):
            break
        else:
            print('Invalid input\n')

...next part of the code

When I run this, after the Y/N prompt, even if I type Y, the program just skips over to the next part of my code without creating any new file. 当我运行它时,在Y / N提示后,即使我输入Y,程序也只是跳到代码的下一部分而不创建任何新文件。 Debugger shows the following: 调试器显示以下内容:

Do you wish to create batch files? (Original file will not be overwritten)
(Y/N):Y
Traceback (most recent call last):
  File "\Biopython\mainscript.py", line 32, in batch_iterator
    entry = next(iterator)
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Program Files (x86)\Thonny\lib\site-packages\thonny\backend.py", line 1569, in _trace
    return self._trace_and_catch(frame, event, arg)

  File "C:\Program Files (x86)\Thonny\lib\site-packages\thonny\backend.py", line 1611, in _trace_and_catch
    frame.f_back, event, marker_function_args, node

  File "C:\Program Files (x86)\Thonny\lib\site-packages\thonny\backend.py", line 1656, in _handle_progress_event
    self._save_current_state(frame, event, args, node)

  File "C:\Program Files (x86)\Thonny\lib\site-packages\thonny\backend.py", line 1738, in _save_current_state
    exception_info = self._export_exception_info()

  File "C:\Program Files (x86)\Thonny\lib\site-packages\thonny\backend.py", line 1371, in _export_exception_info
    "affected_frame_ids": exc[1]._affected_frame_ids_,

AttributeError: 'StopIteration' object has no attribute '_affected_frame_ids_'

Is there some difference between Python 2.x and 3.x that I'm overlooking? 我忽略了Python 2.x和3.x之间有什么区别吗? Is the problem somewhere else? 问题出在别处吗? Is this approach completely wrong? 这种做法完全错了吗? Thanks in advance! 提前致谢!

I can't check your whole code since you've ommited part of it, but I can see two wrong things here: 我无法检查你的整个代码,因为你已经忽略了它的一部分,但我可以在这里看到两个错误的东西:

num = 0
for line in fastafile:
    if line.startswith(">"):
        num += 1

These lines are exhausting your file object fastafile . 这些行正在耗尽文件对象fastafile Remove these lines entirely (and remember to fix the indentation below, remove the if num > 400: check, etc). 完全删除这些行(并记住修复下面的缩进,删除if num > 400: check等)。

if (decision == 'Y' or 'y'):

This does not do what you think it does. 这不符合你的想法。 Change it to if decision in ('Y', 'y'): or if decision.lower() == 'y': . if decision in ('Y', 'y'):或者if decision.lower() == 'y':if decision in ('Y', 'y'):其更改为if decision.lower() == 'y': You repeat this pattern below in the line if (decision == 'N' or 'n'): , so change that, too. 你在行if (decision == 'N' or 'n'):下面重复这个模式,所以也改变它。

Make the changes and try to run the code again. 进行更改并尝试再次运行代码。

Explanation 说明

1st issue : in Python, a file object (ie what open('filename.txt', 'r') returns) is a generator, which means that it can only be iterated over once. 第一个问题 :在Python中,文件对象(即open('filename.txt', 'r')返回)是一个生成器,这意味着它只能迭代一次。 This may seem a bit weird at first, but that's the whole point of using generators. 起初看起来有点奇怪,但这是使用发电机的重点。 A generator as a file object allows the file to be looped over line by line, without ever having to load the whole file content at once - the generator just keeps track of which line comes next. 作为文件对象的生成器允许文件逐行循环,而不必一次加载整个文件内容 - 生成器只跟踪下一行。

The flipside is that they can't go backwards, so when you write your for line in fastafile block, you exhaust the generator. 另一方面是它们不能倒退,所以当你for line in fastafile块中写下你的for line in fastafile ,你会耗尽发电机。 When you later try to call batch_iterator(record_iter, 400) , the generator in record_iter is already exhausted, which is why you'll encounter an error later on - the batch_iterator cannot parse the fasta sequences if there's nothing left there to parse. 当您稍后尝试调用batch_iterator(record_iter, 400)record_iter的生成器已经用尽,这就是您以后会遇到错误的原因 - 如果没有任何内容需要解析,则batch_iterator无法解析fasta序列。

2nd issue : for conditionals with boolean operators such as if (decision == 'Y' or 'y'): , Python will always evaluate both sides individually. 第二个问题 :对于具有布尔运算符的条件句,例如if (decision == 'Y' or 'y'): :, Python将始终单独评估双方。 So Python actually sees if (bool(decision == 'Y') or bool('y')): . 所以Python实际上看到了if (bool(decision == 'Y') or bool('y')):

Since bool('y') evaluates to True (just like any non-empty string), your expression becomes if (bool(decision == 'Y') or True): , which is obviously always true. 由于bool('y')计算结果为True (就像任何非空字符串一样),你的表达式变为if (bool(decision == 'Y') or True):这显然是真的。

Use one of the methods I suggested in order to compare a variable to more than one value in a conditional. 使用我建议的方法之一,以便将变量与条件中的多个值进行比较。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM