I have made a module that detects the encoding of a file. I want to be able to able to give file path and encoding as inputs to the class and always be able to get back 'utf-8' when I process the contents of the file.
For example something like this
handler = UnicodeWrapper(file_path, encoding='ISO-8859-2')
for line in handler:
# need the line to be encoded in utf-8
process(line)
I can not understand why there are a million types of encodings yet. But I want to write an interface that always returns unicode.
Is there a library to do this already?
Based on this answer , I think the following might suit your needs:
import io
class UnicodeWrapper(object):
def __init__(self, filename):
self._filename = filename
def __iter__(self):
with io.open(self._filename,'r', encoding='utf8') as f:
return iter(f.readlines())
if __name__ == '__main__':
filename = r'...'
handler = UnicodeWrapper(filename)
for line in handler:
print(line)
Edit
In Python 2, you can assert that each line is encoded in UTF-8 using something like this:
if __name__ == '__main__':
filename = r'...'
handler = UnicodeWrapper(filename)
for line in handler:
try:
line.decode('utf-8')
# process(line)
except UnicodeDecodeError:
print('Not encoded in UTF-8')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.