Python 3.x - 如何有效地将对象数组拆分为较小的批处理文件？

Question

我是Python的新手，我正在尝试将一个文本文件拆分成两个行，其中包含最多两个行。 400个物体。

我正在使用的数据是FASTA格式的数千个序列（带有标题的纯文本，用于生物信息学），其中条目如下所示：

> HORVU6Hr1G000325.5

PIPPPASHFHPHHQNPSAATQPLCAAMAPAAKKPPLKSSSSHNSAAGDAA

> HORVU6Hr1G000326.1

MVKFTAEELRGIMDKKNNIRNMSVIAHVD

...

在Biopython中，有一个解析器SeqIO.parse允许将这些作为由ID和字符串组成的对象数组来访问，我需要在代码的后续部分使用它，因为我需要内存效率，所以我会喜欢避免读取/解析源文件的次数超过必要的次数。

在Biopython手册中，有一种推荐的方法通过我正在使用的生成器来完成这个： https ：//biopython.org/wiki/Split_large_file

但是，我使用的是Python 3.7，而Python 2.x中有代码，因此肯定会有一些必要的更改。 我改变了

entry = iterator.next（）

成

entry = next（迭代器）

但我不确定这是否需要改变。

这是代码：

def batch_iterator(iterator, batch_size=400):
    """Returns lists of length batch_size."""
    entry = True  # Make sure we loop once
    while entry:
        batch = []
        while len(batch) < batch_size:
            try:
                entry = next(iterator)
            except StopIteration:
                entry = None

            if entry is None:
                # End of file
                break
            batch.append(entry)
        if batch:
            yield batch

while True:
    bsequence = input("Please enter the full path to your FASTA file(e.g. c:\\folder1\\folder2\\protein.fasta):\n")
    try:
        fastafile = open(bsequence)
        break
    except:
        print("File not found!\n")            


record_iter = SeqIO.parse(fastafile,"fasta")
num = 0
for line in fastafile:
    if line.startswith(">"):
        num += 1

print("num=%i" % (num,))
if num > 400:
    print("The specified file contains %i sequences. It's recommended to split the FASTA file into batches of max. 400 sequences.\n" % (num,))
    while True:
        decision = input("Do you wish to create batch files? (Original file will not be overwritten)\n(Y/N):")
        if (decision == 'Y' or 'y'):
            for i, batch in enumerate(batch_iterator(record_iter, 400), 1):
                filename = "group_%i.fasta" % (i + 1)
                with open(filename, "w") as handle:
                    count = SeqIO.write(batch, handle, "fasta")
                print("Wrote %i records to %s" % (count, filename))
            break
        elif (decision == 'N' or 'n'):
            break
        else:
            print('Invalid input\n')

...next part of the code

当我运行它时，在Y / N提示后，即使我输入Y，程序也只是跳到代码的下一部分而不创建任何新文件。 调试器显示以下内容：

Do you wish to create batch files? (Original file will not be overwritten)
(Y/N):Y
Traceback (most recent call last):
  File "\Biopython\mainscript.py", line 32, in batch_iterator
    entry = next(iterator)
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Program Files (x86)\Thonny\lib\site-packages\thonny\backend.py", line 1569, in _trace
    return self._trace_and_catch(frame, event, arg)

  File "C:\Program Files (x86)\Thonny\lib\site-packages\thonny\backend.py", line 1611, in _trace_and_catch
    frame.f_back, event, marker_function_args, node

  File "C:\Program Files (x86)\Thonny\lib\site-packages\thonny\backend.py", line 1656, in _handle_progress_event
    self._save_current_state(frame, event, args, node)

  File "C:\Program Files (x86)\Thonny\lib\site-packages\thonny\backend.py", line 1738, in _save_current_state
    exception_info = self._export_exception_info()

  File "C:\Program Files (x86)\Thonny\lib\site-packages\thonny\backend.py", line 1371, in _export_exception_info
    "affected_frame_ids": exc[1]._affected_frame_ids_,

AttributeError: 'StopIteration' object has no attribute '_affected_frame_ids_'

我忽略了Python 2.x和3.x之间有什么区别吗？ 问题出在别处吗？ 这种做法完全错了吗？ 提前致谢！

Answer 1

我无法检查你的整个代码，因为你已经忽略了它的一部分，但我可以在这里看到两个错误的东西：

num = 0
for line in fastafile:
    if line.startswith(">"):
        num += 1

这些行正在耗尽文件对象fastafile 。 完全删除这些行（并记住修复下面的缩进，删除if num > 400: check等）。

if (decision == 'Y' or 'y'):

这不符合你的想法。 if decision in ('Y', 'y'):或者if decision.lower() == 'y': ， if decision in ('Y', 'y'):其更改为if decision.lower() == 'y': 。 你在行if (decision == 'N' or 'n'):下面重复这个模式，所以也改变它。

进行更改并尝试再次运行代码。

说明

第一个问题 ：在Python中，文件对象（即open('filename.txt', 'r')返回）是一个生成器，这意味着它只能迭代一次。 起初看起来有点奇怪，但这是使用发电机的重点。 作为文件对象的生成器允许文件逐行循环，而不必一次加载整个文件内容 - 生成器只跟踪下一行。

另一方面是它们不能倒退，所以当你for line in fastafile块中写下你的for line in fastafile ，你会耗尽发电机。 当您稍后尝试调用batch_iterator(record_iter, 400) ， record_iter的生成器已经用尽，这就是您以后会遇到错误的原因 - 如果没有任何内容需要解析，则batch_iterator无法解析fasta序列。

第二个问题 ：对于具有布尔运算符的条件句，例如if (decision == 'Y' or 'y'): :, Python将始终单独评估双方。 所以Python实际上看到了if (bool(decision == 'Y') or bool('y')):

由于bool('y')计算结果为True （就像任何非空字符串一样），你的表达式变为if (bool(decision == 'Y') or True):这显然是真的。

使用我建议的方法之一，以便将变量与条件中的多个值进行比较。

Python 3.x - 如何有效地将对象数组拆分为较小的批处理文件？

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-04-24 22:24:10

说明

Python 3.x - 如何有效地将对象数组拆分为较小的批处理文件？

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-04-24 22:24:10

说明

解决方案1
2 已采纳 2019-04-24 22:24:10