简体   繁体   中英

Generic way to open (possibly gzipped) file with specific text encoding in python

I am writing a piece of code that opens a (possibly gzipped) textfile that works in both Python 2 and Python 3.

If I would have only normal textfiles (not compressed) I could do:

import io
for line in io.open(file_name, encoding='some_encoding'):

If I would not care about decoding (using strings / bytes in python 2/3)

if file_name.endswith('.gz'):
    file_obj = gzip.open(file_name)
    file_obj = open(file_name)

for line in file_obj:

How can I in a smooth way take care of both of these cases? In other words, how to smoothly integrate decode with gzip.open()?

I tested this briefly and it seems to do the right thing. You can provide a file obj to gzip.GzipFile and to io.open so

import io
import gzip

f_obj = open('file.gz','r')
io_obj = io.open(f_obj.fileno(), encoding='UTF-8')
gzip_obj = gzip.GzipFile(fileobj=io_obj, mode='r')

That gives me a UnicodeDecodeError because the file I'm reading isn't actually UTF-8 so it would appear to be doing the right thing.

For some reason if I use io.open to open file.gz directly gzip says that the file is not a compressed file.

UPDATE Yeah, that's silly, the streams are the wrong way around to begin with.

test file


The following code decodes the compressed file with the defined codec

import codecs
import gzip
gz_fh = gzip.open('file.gz')
ascii = codecs.getreader('ASCII')
utf8 = codecs.getreader('UTF-8') 
ascii_fh = ascii(gz_fh)
utf8_fh = utf8(gz_fh)
-> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

-> [u'\xf6\n', u'\xe4\n', u'u\n', u'y']

The codecs.StreamReader takes a stream so you should be able to pass the compressed or uncompressed files to it.


The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM