简体   繁体   English

读取/写入/解析大型文本文件的有效方法(python)

[英]Efficient way to read/write/parse large text files (python)

Say I have an absurdly large text file. 说我有一个荒唐的大文本文件。 I would not think my file would grow larger than ~500mb, but for the sake of scalability and my own curiosity, let's say it is on the order of a few gig. 我不认为我的文件会增长到超过500mb,但出于可扩展性和我自己的好奇心,让我们说它大约是几千字节。

My end goal is to map it to an array of sentences (separated by '?' '!' '.' and for all intents and purposes ';') and each sentence to an array of words. 我的最终目标是将其映射到句子数组(以“?”,“!”“。”分隔,并出于所有意图和目的“;”),并将每个句子映射到单词数组。 I was then going to use numpy for some statistical analysis. 然后,我将使用numpy进行一些统计分析。

What would be the most scalable way to go about doing this? 做这件事的最可扩展的方法是什么?

PS: I thought of rewriting the file to have one sentence per line, but I ran into problems trying to load the file into memory. PS:我想重写文件以使每行只有一个句子,但是在尝试将文件加载到内存时遇到了问题。 I know of the solution where you read off chucks of data in one file, manipulate them, and write them to another, but that seems inefficient with disk memory. 我知道一种解决方案,其中您可以读取一个文件中的数据块,然后将它们处理,然后将它们写入另一个文件中,但这对于磁盘内存来说似乎效率很低。 I know, most people would not worry about using 10gig of scratch space nowadays, but it does seem like there ought to be a way of directly editing chucks of the file. 我知道,当今大多数人都不会担心使用10gig的暂存空间,但是似乎应该有一种直接编辑文件夹的方法。

My first thought would be to use a stream parser: basically you read in the file a piece at a time and do the statistical analysis as you go. 我的第一个想法是使用流解析器:基本上,您一次读入一个文件,然后进行统计分析。 This is typically done with markup languages like HTML and XML, so you'll find a lot of parsers for those languages out there, including in the Python standard library. 通常使用HTML和XML这样的标记语言来完成此操作,因此您可以在其中找到很多针对这些语言的解析器,包括Python标准库。 A simple sentence parser is something you can write yourself, though; 不过,您可以编写一个简单的句子解析器; for example: 例如:

import re, collections
sentence_terminator = re.compile(r'(?<=[.!?;])\s*')
class SentenceParser(object):
    def __init__(self, filelike):
        self.f = filelike
        self.buffer = collections.deque([''])
    def next(self):
        while len(self.buffer) < 2:
            data = self.f.read(512)
            if not data:
                raise StopIteration()
            self.buffer += sentence_terminator.split(self.buffer.pop() + data)
        return self.buffer.popleft()
    def __iter__(self):
        return self

This will only read data from the file as needed to complete a sentence. 这只会根据需要从文件中读取数据以完成句子。 It reads in 512-byte blocks so you'll be holding less than a kilobyte of file contents in memory at any one time, no matter how large the actual file is. 它读取512字节的块,因此无论实际文件有多大,您一次都可以在内存中保留不到一千字节的文件内容。

After a stream parser, my second thought would be to memory map the file. 在流解析器之后,我的第二个想法是对文件进行内存映射 That way you could go through and replace the space that (presumably) follows each sentence terminator by a newline; 这样,您可以遍历并用换行符替换(大概)每个句子结尾后面的空格; after that, each sentence would start on a new line, and you'd be able to open the file and use readline() or a for loop to go through it line by line. 之后,每个句子将从新的一行开始,您可以打开文件并使用readline()for循环逐行浏览该文件。 But you'd still have to worry about multi-line sentences; 但是,您仍然必须担心多行句子。 plus, if any sentence terminator is not followed by a whitespace character, you would have to insert a newline (instead of replacing something else with it) and that could be horribly inefficient for a large file. 另外,如果任何句子终止符后面都没有空格字符,则您必须插入换行符(而不是用换行符替换),这对于大文件而言可能效率极低。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM