This is my first post on Stack Overflow, I have a question regarding extracting a single file from a TAR file using GZ compression. I'm not the best at Python so I may be doing this incorrectly, any help would be much appreciated.
Scenario:
Corrupted *.tar.gz file comes in, the first file in the GZ contains important information for obtaining the SN of the system. This can be used to identify the machine so that we can issue a notification to it's administrator that the file was corrupted.
The Problem :
Using the regular UNIX tar binary I am able to extract just the README file from the archive even though the archive is not complete and would return an error upon extracting it fully. However, in Python I am unable to extract just one file, it always returns an exception even if I'm specifying just the single file.
Current Workaround :
I'm using "os.popen" to use the UNIX tar binary in order to obtain just the README file.
Desired Solution :
To use the Python tarfile package to extract just the single file.
Example Error:
UNIX (Works):
[root@athena tmp]# tar -xvzf bundle.tar.gz README
README
gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
[root@athena tmp]#
[root@athena tmp]# ls
bundle.tar.gz README
Python:
>>> import tarfile
>>> tar = tarfile.open("bundle.tar.gz")
>>> data = tar.extractfile("README").read()
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib64/python2.4/tarfile.py", line 1364, in extractfile
tarinfo = self.getmember(member)
File "/usr/lib64/python2.4/tarfile.py", line 1048, in getmember
tarinfo = self._getmember(name)
File "/usr/lib64/python2.4/tarfile.py", line 1762, in _getmember
members = self.getmembers()
File "/usr/lib64/python2.4/tarfile.py", line 1059, in getmembers
self._load() # all members, we first have to
File "/usr/lib64/python2.4/tarfile.py", line 1778, in _load
tarinfo = self.next()
File "/usr/lib64/python2.4/tarfile.py", line 1588, in next
self.fileobj.seek(self.offset)
File "/usr/lib64/python2.4/gzip.py", line 377, in seek
self.read(1024)
File "/usr/lib64/python2.4/gzip.py", line 225, in read
self._read(readsize)
File "/usr/lib64/python2.4/gzip.py", line 273, in _read
self._read_eof()
File "/usr/lib64/python2.4/gzip.py", line 309, in _read_eof
raise IOError, "CRC check failed"
IOError: CRC check failed
>>> print data
Traceback (most recent call last):
File "<stdin>", line 1, in ?
NameError: name 'data' is not defined
Python (Handling Exception):
>>> tar = tarfile.open("bundle.tar.gz")
>>> try:
... data = tar.extractfile("README").read()
... except:
... pass
...
>>> print(data)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
NameError: name 'data' is not defined
Using the manual Unix method it looks like gzip unzips the file up to the point where it breaks.
Python gzip(or tar) module exits as soon as it notices you have a corrupt archive due to the failed CRC check.
Just an idea but you could pre-process the damaged archives with gzip and re-compress them to correct the CRC.
gunzip < damaged.tar.gz | gzip > corrected.tar.gz
This will give you a corrected.tar.gz which will now contain all the data until the point where the archive was broken. You should now be able to use python tar/gzip library without getting CRC exceptions.
Keep in mind this command will un-gzip and gzip the archive, which costs storage IO and CPU time and you shouldn't do it for all your archives.
In order to be efficient you should only run it in case you get the IOError: CRC check failed exception.
You can do something like this -- attempt to decompress the gzip file into a temp file, then try extracting the magic file from that. In the following example I'm pretty agressive about trying to read the entire file -- depending on the block size of the gzipped data you can likely get away with reading at maximum 128-256k. My gut tells me that gzip works in maximum of 64k blocks but I make no promises.
This method does everything in memory without needing intermediate files / writing to disk, but it does keep the entire amount of decompressed data in memory as well so.. I'm not joking about fine-tuning this for your specific use-case.
#!/usr/bin/python
import gzip
import tarfile
import StringIO
# Depending on how your tar file is constructed, you might need to specify
# './README' as your magic_file
magic_file = 'README'
f = gzip.open('corrupt', 'rb')
t = StringIO.StringIO()
try:
while 1:
block = f.read(1024)
t.write(block)
except Exception as e:
print str(e)
print '%d bytes decompressed' % (t.tell())
t.seek(0)
tarball = tarfile.TarFile.open(name=None, mode='r', fileobj=t)
try:
magic_data = tarball.getmember(magic_file).tobuf()
# I didn't actually try this part, but in theory
# getmember returns a tarinfo object which you can
# use to extract the file
# search magic data for serial number or print out the
# file
print magic_data
except Exception as e:
print e
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.