读取/写入/解析大型文本文件的有效方法（python）

Question

Say I have an absurdly large text file. 说我有一个荒唐的大文本文件。 I would not think my file would grow larger than ~500mb, but for the sake of scalability and my own curiosity, let's say it is on the order of a few gig. 我不认为我的文件会增长到超过500mb，但出于可扩展性和我自己的好奇心，让我们说它大约是几千字节。

My end goal is to map it to an array of sentences (separated by '?' '!' '.' and for all intents and purposes ';') and each sentence to an array of words. 我的最终目标是将其映射到句子数组（以“？”，“！”“。”分隔，并出于所有意图和目的“;”），并将每个句子映射到单词数组。 I was then going to use numpy for some statistical analysis. 然后，我将使用numpy进行一些统计分析。

What would be the most scalable way to go about doing this? 做这件事的最可扩展的方法是什么？

PS: I thought of rewriting the file to have one sentence per line, but I ran into problems trying to load the file into memory. PS：我想重写文件以使每行只有一个句子，但是在尝试将文件加载到内存时遇到了问题。 I know of the solution where you read off chucks of data in one file, manipulate them, and write them to another, but that seems inefficient with disk memory. 我知道一种解决方案，其中您可以读取一个文件中的数据块，然后将它们处理，然后将它们写入另一个文件中，但这对于磁盘内存来说似乎效率很低。 I know, most people would not worry about using 10gig of scratch space nowadays, but it does seem like there ought to be a way of directly editing chucks of the file. 我知道，当今大多数人都不会担心使用10gig的暂存空间，但是似乎应该有一种直接编辑文件夹的方法。

Answer 1

My first thought would be to use a stream parser: basically you read in the file a piece at a time and do the statistical analysis as you go. 我的第一个想法是使用流解析器：基本上，您一次读入一个文件，然后进行统计分析。 This is typically done with markup languages like HTML and XML, so you'll find a lot of parsers for those languages out there, including in the Python standard library. 通常使用HTML和XML这样的标记语言来完成此操作，因此您可以在其中找到很多针对这些语言的解析器，包括Python标准库。 A simple sentence parser is something you can write yourself, though; 不过，您可以编写一个简单的句子解析器； for example: 例如：

import re, collections
sentence_terminator = re.compile(r'(?<=[.!?;])\s*')
class SentenceParser(object):
    def __init__(self, filelike):
        self.f = filelike
        self.buffer = collections.deque([''])
    def next(self):
        while len(self.buffer) < 2:
            data = self.f.read(512)
            if not data:
                raise StopIteration()
            self.buffer += sentence_terminator.split(self.buffer.pop() + data)
        return self.buffer.popleft()
    def __iter__(self):
        return self

This will only read data from the file as needed to complete a sentence. 这只会根据需要从文件中读取数据以完成句子。 It reads in 512-byte blocks so you'll be holding less than a kilobyte of file contents in memory at any one time, no matter how large the actual file is. 它读取512字节的块，因此无论实际文件有多大，您一次都可以在内存中保留不到一千字节的文件内容。

After a stream parser, my second thought would be to memory map the file. 在流解析器之后，我的第二个想法是对文件进行内存映射。 That way you could go through and replace the space that (presumably) follows each sentence terminator by a newline; 这样，您可以遍历并用换行符替换（大概）每个句子结尾后面的空格； after that, each sentence would start on a new line, and you'd be able to open the file and use readline() or a for loop to go through it line by line. 之后，每个句子将从新的一行开始，您可以打开文件并使用readline()或for循环逐行浏览该文件。 But you'd still have to worry about multi-line sentences; 但是，您仍然必须担心多行句子。 plus, if any sentence terminator is not followed by a whitespace character, you would have to insert a newline (instead of replacing something else with it) and that could be horribly inefficient for a large file. 另外，如果任何句子终止符后面都没有空格字符，则您必须插入换行符（而不是用换行符替换），这对于大文件而言可能效率极低。

读取/写入/解析大型文本文件的有效方法（python）

问题描述

1 个解决方案

解决方案1
5 已采纳 2011-12-21 00:05:45

读取/写入/解析大型文本文件的有效方法（python）

问题描述

1 个解决方案

解决方案1 5 已采纳 2011-12-21 00:05:45

解决方案1
5 已采纳 2011-12-21 00:05:45