简体   繁体   English

处理大文件的最快方法?

[英]Fastest way to process a large file?

I have multiple 3 GB tab delimited files. 我有多个3 GB制表符分隔文件。 There are 20 million rows in each file. 每个文件中有2000万行。 All the rows have to be independently processed, no relation between any two rows. 所有行必须独立处理,任何两行之间没有关系。 My question is, what will be faster A. Reading line-by-line using: 我的问题是,什么会更快A.使用以下方式逐行阅读:

with open() as infile:
    for line in infile:

Or B. Reading the file into memory in chunks and processing it, say 250 MB at a time? 或者B.以块的形式将文件读入内存并进行处理,一次说250 MB?

The processing is not very complicated, I am just grabbing value in column1 to List1 , column2 to List2 etc. Might need to add some column values together. 处理不是很复杂,我只是将column1中的值抓到List1 ,将column2 List2等。可能需要一起添加一些列值。

I am using python 2.7 on a linux box that has 30GB of memory. 我在具有30GB内存的Linux机器上使用python 2.7。 ASCII Text. ASCII文本。

Any way to speed things up in parallel? 有什么方法可以加速并行? Right now I am using the former method and the process is very slow. 现在我正在使用前一种方法,而且过程非常缓慢。 Is using any CSVReader module going to help? 使用任何CSVReader模块都可以提供帮助吗? I don't have to do it in python, any other language or database use ideas are welcome. 我不必在python中使用它,任何其他语言或数据库使用的想法都是受欢迎的。

It sounds like your code is I/O bound. 听起来你的代码是I / O绑定的。 This means that multiprocessing isn't going to help—if you spend 90% of your time reading from disk, having an extra 7 processes waiting on the next read isn't going to help anything. 这意味着多处理不会有所帮助 - 如果您花费90%的时间从磁盘读取数据,那么在下次读取时等待额外的7个进程对任何事情都无济于事。

And, while using a CSV reading module (whether the stdlib's csv or something like NumPy or Pandas) may be a good idea for simplicity, it's unlikely to make much difference in performance. 而且,虽然使用CSV读取模块(无论是stdlib的csv还是像NumPy或Pandas这样的东西)可能是一个简单的好主意,但它不太可能在性能上有很大的不同。

Still, it's worth checking that you really are I/O bound, instead of just guessing. 不过,值得检查一下你是否真的 I / O限制,而不仅仅是猜测。 Run your program and see whether your CPU usage is close to 0% or close to 100% or a core. 运行程序,查看CPU使用率是接近0%还是接近100%或核心。 Do what Amadan suggested in a comment, and run your program with just pass for the processing and see whether that cuts off 5% of the time or 70%. 做Amadan在评论中提出的建议,并运行你的程序,只需pass处理,看看是否会减少5%的时间或70%。 You may even want to try comparing with a loop over os.open and os.read(1024*1024) or something and see if that's any faster. 您甚至可能想尝试与os.openos.read(1024*1024)的循环进行比较,看看是否更快。


Since your using Python 2.x, Python is relying on the C stdio library to guess how much to buffer at a time, so it might be worth forcing it to buffer more. 由于你使用Python 2.x,Python依靠C stdio库来猜测一次缓冲多少,因此可能值得强制缓冲更多。 The simplest way to do that is to use readlines(bufsize) for some large bufsize . 最简单的方法是对某些大型bufsize使用readlines(bufsize) (You can try different numbers and measure them to see where the peak is. In my experience, usually anything from 64K-8MB is about the same, but depending on your system that may be different—especially if you're, eg, reading off a network filesystem with great throughput but horrible latency that swamps the throughput-vs.-latency of the actual physical drive and the caching the OS does.) (您可以尝试不同的数字并测量它们以查看峰值的位置。根据我的经验,通常64K-8MB的任何东西大致相同,但取决于您的系统可能会有所不同 - 特别是如果您是,例如,阅读关闭网络文件系统,吞吐量很大,但延迟时间很长,淹没了实际物理驱动器的吞吐量与延迟以及操作系统的缓存。)

So, for example: 所以,例如:

bufsize = 65536
with open(path) as infile: 
    while True:
        lines = infile.readlines(bufsize)
        if not lines:
            break
        for line in lines:
            process(line)

Meanwhile, assuming you're on a 64-bit system, you may want to try using mmap instead of reading the file in the first place. 同时,假设您使用的是64位系统,您可能希望尝试使用mmap而不是首先读取文件。 This certainly isn't guaranteed to be better, but it may be better, depending on your system. 这当然不能保证更好,但可能会更好,具体取决于您的系统。 For example: 例如:

with open(path) as infile:
    m = mmap.mmap(infile, 0, access=mmap.ACCESS_READ)

A Python mmap is sort of a weird object—it acts like a str and like a file at the same time, so you can, eg, manually iterate scanning for newlines, or you can call readline on it as if it were a file. Python mmap是一种奇怪的对象 - 它同时像一个str ,就像一个file一样,所以你可以,例如,手动迭代扫描换行,或者你可以像readline文件一样调用readline Both of those will take more processing from Python than iterating the file as lines or doing batch readlines (because a loop that would be in C is now in pure Python… although maybe you can get around that with re , or with a simple Cython extension?)… but the I/O advantage of the OS knowing what you're doing with the mapping may swamp the CPU disadvantage. 这两个将从Python中进行更多的处理,而不是将文件作为行或批处理readlines线进行迭代(因为在C中的循环现在是纯Python ...虽然也许你可以用re或者简单的Cython扩展来解决这个问题?)...但操作系统的I / O优势知道你正在使用映射做什么可能会淹没CPU的劣势。

Unfortunately, Python doesn't expose the madvise call that you'd use to tweak things in an attempt to optimize this in C (eg, explicitly setting MADV_SEQUENTIAL instead of making the kernel guess, or forcing transparent huge pages)—but you can actually ctypes the function out of libc . 不幸的是,Python没有暴露你用来调整东西的madvise调用,试图在C中优化它(例如,显式设置MADV_SEQUENTIAL而不是使内核猜测,或强制透明的大页面) - 但你实际上可以ctypes libc的功能。

I know this question is old; 我知道这个问题很古老; but I wanted to do a similar thing, I created a simple framework which helps you read and process a large file in parallel. 但我想做类似的事情,我创建了一个简单的框架,可以帮助您并行读取和处理大文件。 Leaving what I tried as an answer. 留下我尝试的答案。

This is the code, I give an example in the end 这是代码,我最后给出了一个例子

def chunkify_file(fname, size=1024*1024*1000, skiplines=-1):
    """
    function to divide a large text file into chunks each having size ~= size so that the chunks are line aligned

    Params : 
        fname : path to the file to be chunked
        size : size of each chink is ~> this
        skiplines : number of lines in the begining to skip, -1 means don't skip any lines
    Returns : 
        start and end position of chunks in Bytes
    """
    chunks = []
    fileEnd = os.path.getsize(fname)
    with open(fname, "rb") as f:
        if(skiplines > 0):
            for i in range(skiplines):
                f.readline()

        chunkEnd = f.tell()
        count = 0
        while True:
            chunkStart = chunkEnd
            f.seek(f.tell() + size, os.SEEK_SET)
            f.readline()  # make this chunk line aligned
            chunkEnd = f.tell()
            chunks.append((chunkStart, chunkEnd - chunkStart, fname))
            count+=1

            if chunkEnd > fileEnd:
                break
    return chunks

def parallel_apply_line_by_line_chunk(chunk_data):
    """
    function to apply a function to each line in a chunk

    Params :
        chunk_data : the data for this chunk 
    Returns :
        list of the non-None results for this chunk
    """
    chunk_start, chunk_size, file_path, func_apply = chunk_data[:4]
    func_args = chunk_data[4:]

    t1 = time.time()
    chunk_res = []
    with open(file_path, "rb") as f:
        f.seek(chunk_start)
        cont = f.read(chunk_size).decode(encoding='utf-8')
        lines = cont.splitlines()

        for i,line in enumerate(lines):
            ret = func_apply(line, *func_args)
            if(ret != None):
                chunk_res.append(ret)
    return chunk_res

def parallel_apply_line_by_line(input_file_path, chunk_size_factor, num_procs, skiplines, func_apply, func_args, fout=None):
    """
    function to apply a supplied function line by line in parallel

    Params :
        input_file_path : path to input file
        chunk_size_factor : size of 1 chunk in MB
        num_procs : number of parallel processes to spawn, max used is num of available cores - 1
        skiplines : number of top lines to skip while processing
        func_apply : a function which expects a line and outputs None for lines we don't want processed
        func_args : arguments to function func_apply
        fout : do we want to output the processed lines to a file
    Returns :
        list of the non-None results obtained be processing each line
    """
    num_parallel = min(num_procs, psutil.cpu_count()) - 1

    jobs = chunkify_file(input_file_path, 1024 * 1024 * chunk_size_factor, skiplines)

    jobs = [list(x) + [func_apply] + func_args for x in jobs]

    print("Starting the parallel pool for {} jobs ".format(len(jobs)))

    lines_counter = 0

    pool = mp.Pool(num_parallel, maxtasksperchild=1000)  # maxtaskperchild - if not supplied some weird happend and memory blows as the processes keep on lingering

    outputs = []
    for i in range(0, len(jobs), num_parallel):
        print("Chunk start = ", i)
        t1 = time.time()
        chunk_outputs = pool.map(parallel_apply_line_by_line_chunk, jobs[i : i + num_parallel])

        for i, subl in enumerate(chunk_outputs):
            for x in subl:
                if(fout != None):
                    print(x, file=fout)
                else:
                    outputs.append(x)
                lines_counter += 1
        del(chunk_outputs)
        gc.collect()
        print("All Done in time ", time.time() - t1)

    print("Total lines we have = {}".format(lines_counter))

    pool.close()
    pool.terminate()
    return outputs

Say for example, I have a file in which I want to count the number of words in each line, then the processing of each line would look like 比方说,我有一个文件,我想在其中计算每行中的单词数,然后每行的处理看起来像

def count_words_line(line):
    return len(line.strip().split())

and then call the function like: 然后调用函数如:

parallel_apply_line_by_line(input_file_path, 100, 8, 0, count_words_line, [], fout=None)

Using this, I get a speed up of ~8 times as compared to vanilla line by line reading on a sample file of size ~20GB in which I do some moderately complicated processing on each line. 使用这个,我获得了大约8倍的速度,相比于大小为〜20GB的样本文件中的vanilla逐行读取,其中我在每行上进行了一些中等复杂的处理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM