简体   繁体   中英

Generate zip stream without using temp files

I've got a python method which needs to collect lots of data from an API, format it into a CSV, compress it and stream the result back.

I've been Googling and every solution I can find either requires writing to a temp file or keeping the whole archive in memory.

Memory is definitely not an option as I'd get OOM pretty rapidly. Writing to a temp file has a slew of problems associated with it (this box only uses disk for logs at the moment, much longer lead time before download starts, file cleanup issues, etc, etc). Not to mention the fact that it's just nasty.

I'm looking for a library that will allow me to do something like...

C = Compressor(outputstream)
C.BeginFile('Data.csv')
for D in Api.StreamResults():
    C.Write(D)
C.CloseFile()
C.Close()

In other words, something that will be writing the output stream as I write data in.

I've managed to do this in .Net and PHP - but I have no idea how to approach it in Python.

To put things into perspective, by "lots" of data, I mean I need to be able to handle up to ~10 Gb of (raw plaintext) data. This is part of an export/dump process for a big data system.

As the gzip module documentation states, you can pass a file-like object to the GzipFile constructor. Since python is duck-typed, you're free to implement your own stream, like so:

import sys
from gzip import GzipFile

class MyStream(object):
    def write(self, data):
        #write to your stream...
        sys.stdout.write(data) #stdout, for example

gz= GzipFile( fileobj=MyStream(), mode='w'  )
gz.write("something")

@goncaplopp's answer is great, but you can achieve more parallelism if you run gzip externally. Since you are collecting lots of data, it may be worth the extra effort. You'll need to find your own compression routine for windows (there are several gzip implementations, but something like 7z may work also). You could also experiment with things like lz that compress more than gzip, depending on what else you need to optimize in your system.

import subprocess as subp
import os

class GZipWriter(object):

    def __init__(self, filename):
        self.filename = filename
        self.fp = None

    def __enter__(self):
        self.fp = open(self.filename, 'wb')
        self.proc = subp.Popen(['gzip'], stdin=subp.PIPE, stdout=self.fp)
        return self

    def __exit__(self, type, value, traceback):
        self.close()
        if type:
            os.remove(self.filename)

    def close(self):
        if self.fp:
            self.fp.close()
            self.fp = None

    def write(self, data):
        self.proc.stdin.write(data)

with GZipWriter('sometempfile') as gz:
    for i in range(10):
        gz.write('a'*80+'\n')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM