Generic way to open (possibly gzipped) file with specific text encoding in python

Question

I am writing a piece of code that opens a (possibly gzipped) textfile that works in both Python 2 and Python 3.

If I would have only normal textfiles (not compressed) I could do:

import io
for line in io.open(file_name, encoding='some_encoding'):
    pass

If I would not care about decoding (using strings / bytes in python 2/3)

if file_name.endswith('.gz'):
    file_obj = gzip.open(file_name)
else:
    file_obj = open(file_name)

for line in file_obj:
    pass

How can I in a smooth way take care of both of these cases? In other words, how to smoothly integrate decode with gzip.open()?

Answer 1

I tested this briefly and it seems to do the right thing. You can provide a file obj to gzip.GzipFile and to io.open so

import io
import gzip

f_obj = open('file.gz','r')
io_obj = io.open(f_obj.fileno(), encoding='UTF-8')
gzip_obj = gzip.GzipFile(fileobj=io_obj, mode='r')
gzip_obj.read()

That gives me a UnicodeDecodeError because the file I'm reading isn't actually UTF-8 so it would appear to be doing the right thing.

For some reason if I use io.open to open file.gz directly gzip says that the file is not a compressed file.

UPDATE Yeah, that's silly, the streams are the wrong way around to begin with.

test file

ö
ä
u
y

The following code decodes the compressed file with the defined codec

import codecs
import gzip
gz_fh = gzip.open('file.gz')
ascii = codecs.getreader('ASCII')
utf8 = codecs.getreader('UTF-8') 
ascii_fh = ascii(gz_fh)
utf8_fh = utf8(gz_fh)
ascii_fh.readlines()
-> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

utf8_fh.readlines()
-> [u'\xf6\n', u'\xe4\n', u'u\n', u'y']

The codecs.StreamReader takes a stream so you should be able to pass the compressed or uncompressed files to it.

http://docs.python.org/library/codecs.html#codecs

Generic way to open (possibly gzipped) file with specific text encoding in python

Question

1 answers

solution1
1 ACCPTED 2012-09-19 10:33:12

Generic way to open (possibly gzipped) file with specific text encoding in python

Question

1 answers

solution1 1 ACCPTED 2012-09-19 10:33:12

solution1
1 ACCPTED 2012-09-19 10:33:12