简体   繁体   English

通过大文件解析的有效方法

[英]Efficient way to parse through huge file

I have to parse through a really big file, modify its contents, and write that to another file. 我必须解析一个非常大的文件,修改其内容,并将其写入另一个文件。 The file that I have right now is not that big in comparison to what it could be, but it's big nonetheless. 我现在拥有的文件与它可能的文件相比并不是那么大,但它仍然很大。

The file is 1.3 GB and contains about 7 million lines of this format: 该文件为1.3 GB,包含大约700万行此格式:

8823192\t/home/pcastr/...

Where \\t is a tab character. 其中\\t是制表符。 The number at the beginning is the apparent size of the path that follows. 开头的数字是后面路径的表观大小。

I want an output file with lines looking like this (in csv format): 我想要一个输出文件,其行如下所示(采用csv格式):

True,8823192,/home/pcastr/...

Where the first value is whether the path is a directory. 第一个值是路径是否是目录。

Currently, my code looks something like this: 目前,我的代码看起来像这样:

with open(filepath, "r") as open_file:
    while True:
        line = open_file.readline()
        if line == "":  # Checks for the end of the file
            break
        size = line.split("\t")[0]
        path = line.strip().split("\t")[1]
        is_dir = os.path.isdir(path)

        streamed_file.write(unicode("{isdir},{size},{path}\n".format(isdir=is_dir, size=size, path=path))

A caveat with this is that files like this WILL get tremendously big, so I not only need a fast solution, but a memory efficient solution as well. 需要注意的是,像这样的文件会变得非常大,所以我不仅需要快速解决方案,还需要内存高效的解决方案。 I know that there is usually a trade off between these two qualities, 我知道这两种品质之间通常存在权衡,

The biggest gain is likely to come from calling split only once per line 最大的收获就是可能来自调用split每行仅一次

size, path = line.strip().split("\t")
# or ...split("\t", 3)[0:2] if there are extra fields to ignore

You can at least simplify your code by treating the input file as an iterator and using the csv module. 您可以通过将输入文件视为迭代器并使用csv模块来至少简化代码。 This might give you a speed-up as well, as it eliminates the need for an explicit call to split : 这也可以为您提供加速,因为它不需要显式调用split

with open(filepath, "r") as open_file:
    reader = csv.reader(open_file, delimiter="\t")
    writer = csv.writer(streamed_file)
    for size, path in reader:
       is_dir = os.path.isdir(path)
       writer.writerow([is_dir, size, path])

You might need mmap . 你可能需要mmap Introduction and tutorial here . 这里介绍和教程。

As a simplification, it means you can treat files on disk as if they were in RAM, without actually reading the whole file into RAM. 作为一种简化,它意味着您可以像处理RAM一样处理磁盘上的文件,而无需将整个文件实际读入RAM。

Compressing the file before copying trough the network could speed up the processing of data because you will get your data to your script faster. 通过网络复制之前压缩文件可以加快数据处理速度,因为您可以更快地将数据传输到脚本中。

Can you keep the input text file compressed on the remote target system? 您可以将输入文本文件压缩在远程目标系统上吗? if yes, you could compress it to a format using an algorithm that is supported in python (modules zlib, gzip, bz2, lzma, zipfile) 如果是,您可以使用python支持的算法将其压缩为格式(模块zlib,gzip,bz2,lzma,zipfile)

If no you could at least run a script on remote storage system to compress the file. 如果没有,您至少可以在远程存储系统上运行脚本来压缩文件。 Next you would read the file and decompress it in memory using one of the python modules and then process each line. 接下来,您将读取该文件并使用其中一个python模块在内存中解压缩,然后处理每一行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM