简体   繁体   中英

Knowing current compressed file size using gzwrite (zlib)

I'm using zlib for c++.

Quote from http://refspecs.linuxbase.org/LSB_3.0.0/LSB-PDA/LSB-PDA/zlib-gzwrite-1.html regarding gzwrite function:

The gzwrite() function shall write data to the compressed file referenced by file , which shall have been opened in a write mode (see gzopen() and gzdopen() ). On entry, buf shall point to a buffer containing len bytes of uncompressed data. The gzwrite() function shall compress this data and write it to file. The gzwrite() function shall return the number of uncompressed bytes actually written.

I interpret this as the return value will NOT tell me how much larger the file became when writing. Only how much data was compressed into the file.

The only way to know how large the file is would then be to close it, and read the size from the file system. I have a requirement to only continue to write to the file until it reaches a certain size. Can this be achieved without closing the file?

A workaround would be to write until the uncompressed size reaches my limit and then close the file, read the size from file system and update my best guess of file size based on that, and then re-open the file and continue writing. This would make me close and open the file a few times towards the end (as I'm approaching the size limit).

Another workaround, which would give more of an estimate (which is not what I want really) would be to write until uncompressed size reaches the limit, close the file, read the file size from the file system and calculate the compression ratio so far. The I can use this compression ratio to calculate a new limit for uncompressed file size where the compression should get me down to the limit for the compressed file size. If I repeat this the estimate would improve, but again, not what I'm looking for.

Are there better options?

Preferred option would be if zlib could tell me the compressed file size while the file is still open. I don't see why this information would not be available inside zlib at this point, since compression happens when I call gzwrite and not when i close the file.

zlib provides the function gzoffset() , which does exactly what you're asking.

If for some reason you are stuck with a version of zlib that is more than about eight years old, when gzoffset() was added, then this is easy to do with gzdopen() . You open the output file with fopen() or open() , and provide the file descriptor (using fileno() and dup() if you used fopen() ), and then provide that descriptor to gzdopen() . Then you can use ftell() or lseek() at any time to see how much as been written. Be careful to not try to double-close the descriptor. See the comments for gzdopen() .

You can work around this issue by using a pipe. The idea is to write the compressed data into a pipe. After that, you read the data from the other end of the pipe, count it and write it to the actual file.

To set this up you need to first open the file to write to via a simple open . Then create a pipe via pipe2 and initialize zlib by passing one of the pipe descriptors to gzdopen :

int out = open("/path/to/file", O_WRONLY | O_CREAT | O_TRUNC);
int p[2];
pipe2(p, O_NONBLOCK);
gzFile zFile = gzdopen(p[0], "w");

You can now write the data first to the pipe and then splice it from the pipe to the out file:

gzwrite(zFile, buf, 1024); //or any other length
size_t bytesWritten = 0;
do {
    bytesWritten = splice(p[1], NULL, out, NULL, 1024, SPLICE_F_NONBLOCK | SPLICE_F_MORE);
} while(bytesWritten == 1024);

As you can see, you now have the bytesWritten to tell you how much data was actually written. Simply sum it up in another variable and stop splicing as soon as you have written as much data as you need to (or just splice it in one go by writing everything to the zFile and the splice once with the amount of data you are allowed to store as the fifth parameter. If you want to not compress uneccessary data, simply do it in chunks as shown above).

A note on splice: Splice is linux specific, and is basically just a very efficient copy. You can always replace it with a simple "read and write" combo, ie read data from fd[1] into a buffer and then write the data from that buffer into out - splice is just faster and less code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM