Handling Single File Extraction From Corrupted GZ (TAR)

Question

This is my first post on Stack Overflow, I have a question regarding extracting a single file from a TAR file using GZ compression. I'm not the best at Python so I may be doing this incorrectly, any help would be much appreciated.

Scenario:

Corrupted *.tar.gz file comes in, the first file in the GZ contains important information for obtaining the SN of the system. This can be used to identify the machine so that we can issue a notification to it's administrator that the file was corrupted.

The Problem :

Using the regular UNIX tar binary I am able to extract just the README file from the archive even though the archive is not complete and would return an error upon extracting it fully. However, in Python I am unable to extract just one file, it always returns an exception even if I'm specifying just the single file.

Current Workaround :

I'm using "os.popen" to use the UNIX tar binary in order to obtain just the README file.

Desired Solution :

To use the Python tarfile package to extract just the single file.

Example Error:

UNIX (Works):

[root@athena tmp]# tar -xvzf bundle.tar.gz README
README

gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
[root@athena tmp]# 
[root@athena tmp]# ls
bundle.tar.gz  README

Python:

>>> import tarfile
>>> tar = tarfile.open("bundle.tar.gz")
>>> data = tar.extractfile("README").read()
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/lib64/python2.4/tarfile.py", line 1364, in extractfile
    tarinfo = self.getmember(member)
  File "/usr/lib64/python2.4/tarfile.py", line 1048, in getmember
    tarinfo = self._getmember(name)
  File "/usr/lib64/python2.4/tarfile.py", line 1762, in _getmember
    members = self.getmembers()
  File "/usr/lib64/python2.4/tarfile.py", line 1059, in getmembers
    self._load()        # all members, we first have to
  File "/usr/lib64/python2.4/tarfile.py", line 1778, in _load
    tarinfo = self.next()
  File "/usr/lib64/python2.4/tarfile.py", line 1588, in next
    self.fileobj.seek(self.offset)
  File "/usr/lib64/python2.4/gzip.py", line 377, in seek
    self.read(1024)
  File "/usr/lib64/python2.4/gzip.py", line 225, in read
    self._read(readsize)
  File "/usr/lib64/python2.4/gzip.py", line 273, in _read
    self._read_eof()
  File "/usr/lib64/python2.4/gzip.py", line 309, in _read_eof
    raise IOError, "CRC check failed"
IOError: CRC check failed
>>> print data
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
NameError: name 'data' is not defined

Python (Handling Exception):

>>> tar = tarfile.open("bundle.tar.gz")
>>> try:
...     data = tar.extractfile("README").read()
... except:
...     pass
... 
>>> print(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
NameError: name 'data' is not defined

Answer 1

Using the manual Unix method it looks like gzip unzips the file up to the point where it breaks.

Python gzip(or tar) module exits as soon as it notices you have a corrupt archive due to the failed CRC check.

Just an idea but you could pre-process the damaged archives with gzip and re-compress them to correct the CRC.

gunzip < damaged.tar.gz | gzip > corrected.tar.gz

This will give you a corrected.tar.gz which will now contain all the data until the point where the archive was broken. You should now be able to use python tar/gzip library without getting CRC exceptions.

Keep in mind this command will un-gzip and gzip the archive, which costs storage IO and CPU time and you shouldn't do it for all your archives.

In order to be efficient you should only run it in case you get the IOError: CRC check failed exception.

Answer 2

You can do something like this -- attempt to decompress the gzip file into a temp file, then try extracting the magic file from that. In the following example I'm pretty agressive about trying to read the entire file -- depending on the block size of the gzipped data you can likely get away with reading at maximum 128-256k. My gut tells me that gzip works in maximum of 64k blocks but I make no promises.

This method does everything in memory without needing intermediate files / writing to disk, but it does keep the entire amount of decompressed data in memory as well so.. I'm not joking about fine-tuning this for your specific use-case.

#!/usr/bin/python

import gzip 
import tarfile 
import StringIO

# Depending on how your tar file is constructed, you might need to specify 
# './README' as your magic_file

magic_file = 'README'

f = gzip.open('corrupt', 'rb')

t = StringIO.StringIO()

try:
    while 1:
        block = f.read(1024)
        t.write(block) 
except Exception as e:
    print str(e)
    print '%d bytes decompressed' % (t.tell())

t.seek(0) 
tarball = tarfile.TarFile.open(name=None, mode='r', fileobj=t)

try:
    magic_data = tarball.getmember(magic_file).tobuf()
    # I didn't actually try this part, but in theory
    # getmember returns a tarinfo object which you can
    # use to extract the file 

    # search magic data for serial number or print out the
    # file 
    print magic_data 
except Exception as e:
    print e

Handling Single File Extraction From Corrupted GZ (TAR)

Question

2 answers

solution1
0 2010-12-03 23:12:36

solution2
0 2010-12-04 14:06:37

Handling Single File Extraction From Corrupted GZ (TAR)

Question

2 answers

solution1 0 2010-12-03 23:12:36

solution2 0 2010-12-04 14:06:37

solution1
0 2010-12-03 23:12:36

solution2
0 2010-12-04 14:06:37