简体   繁体   中英

reading *.lz4 file in python

I have a huge number of tweet data that are compressed in lz4 formats. I'd like to open each file and decompress it, and extract some information from python.

When I decompress the file using lz4c -d command in Ubuntu, the file decompresses just fine. But when I use lz4.loads('path_to_file') in python, it complains that ValueError: corrupt input at byte 6 . The same error message happens when I try to read() the file in bytes mode. What do I do?

Either prefix your compressed data with the size of the uncompressed data or try upgrading to a later version of the python-lz4 package which has a nicer way of specifying the uncompressed data size.

Either way you need to know the size of the uncompressed data up front.

Note that if you are just decompressing what you just compressed, it will just work since the compressor prefixes the compressed data with its uncompressed size.

Read on for details of my particular case ...

I am using Ubuntu 16.04.1LTS and found that neither using the standard python-lz4 package or importing using the standard pip had sensible working versions of the python lz4 package.

I say sensible because the decompress method in those versions needs the exact size of the decompressed message and it needs to prefix the actual data:

Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import lz4
>>> x = '\xb3\x1a\x00\x10\x005\x08\x00\x00\x00\x00\xff\x01\x00\x80\xf7\xae\xe9\x8fP\x8b\xa5\x14\x1a\x00\x196\x1a\x00\x80\x19\xbd\xe9\x8fP\x8b\xa5\x14'
>>> from struct import *
>>> len(x)
38
>>> # Guess 50 for the size of the uncompressed string ??
... 
>>> block = pack('<I', 50) + x
>>> y = lz4.decompress(block)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: corrupt input at byte 31
>>> # Try a bigger value
...
>>> block = pack('<I', 8192) + x
>>> y = lz4.decompress(block)
>>> len(y)
8192

but now lz4.decompress always returns the size I guessed, which means that I cannot determine the actual size of the decompressed data.

I resorted to cloning python-lz4 from https://github.com/python-lz4/python-lz4 , building and then using the resulting python package. Which gave me an improvement

enter codePython 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import lz4
>>> x = '\xb3\x1a\x00\x10\x005\x08\x00\x00\x00\x00\xff\x01\x00\x80\xf7\xae\xe9\x8fP\x8b\xa5\x14\x1a\x00\x196\x1a\x00\x80\x19\xbd\xe9\x8fP\x8b\xa5\x14'
>>> # I know that the decompressed data will never be greater then 8192 bytes
...
>>> lz4.block.decompress(x, 8192)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: Decompressor wrote 52 bytes, but 8192 bytes expected from header
>>> # Now I know the size required, albeit not programmatically, so ...
...
>>> lz4.block.decompress(x, 52)
'\x1a\x00\x10\x005\x08\x00\x00\x00\x00\xff\xff\xff\xff\xff\xff\xff\xff\xf7\xae\xe9\x8fP\x8b\xa5\x14\x1a\x00\x10\x006\x08\x00\x00\x00\x00\xff\xff\xff\xff\xff\xff\xff\xff\x19\xbd\xe9\x8fP\x8b\xa5\x14'

So the latest version of this package takes the size of the uncompressed data as a parameter and it can tell me the actual size, but only in an exception message.

Looking under the hood, the call to the lz4 C library made from the python-lz4 library actually succeeds when you give it a decompressed size greater than necessary but python-lz4 chooses to throw an exception when the two don't match.

I don't know the background behind that decision, but in my case when I don't know the decompressed data size up front, this is not yet fully useful.

The python-lz4 package contains bindings for both the block and the frame APIs of the LZ4 library. The deprecated loads method is meant for reading in a raw block of LZ4 compressed data. That probably isn't what you want to do - the LZ4 files will be compressed using the frame format.

As of version 0.19.1 the python lz4 package has full support for reading LZ4 compressed files with buffering, like this:

import lz4.frame
chunk_size = 128 * 1024 * 1024
with lz4.frame.open('mybigfile.lz4', 'r') as file:
    chunk = file.read(size=chunk_size)
    # Do stuff with this chunk of data.

which allows you to read the file in and process it in chunks. That prevents the need to hold the full file in memory, or decompress the whole file to disk. On the other hand if you do want to slurp the full file in, simply leave size unspecified in the call to .read() above.

More information can be found in the documentation .

Aside: I am the maintainer of the python lz4 bindings, so if you hit problems, or the docs are unclear, please do file an issue at the project page .

lz4.loads() decompresses the string you pass to it and not the file path in that string. It doesn't seem like this library supports opening files, so you have to read the data yourself.

lz4.loads(open('path_to_file', 'rb').read())

Try with the lz4tools package instead: https://pypi.python.org/pypi/lz4tools

My test fails with lz4

>>> lz4.loads(open("test.js.lz4","rb").read())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: corrupt input at byte 10

But works with lz4tools

>>> lz4tools.open("test.js.lz4").read()
'[{\n    "cc_emails": [],\n    "fwd_emails": [],\n    "reply_cc_emails": [],\n    "fr_escalated": false,\n    "spam": false,\n    "emai.....

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM