简体   繁体   中英

Reading a (presumably) unicode file in python

When I use simple I/o calls to read a particular file on my system, such as:

f = open('file.ini')
for line in f.readlines():
    print line

I'm getting output such as this:

 H E L L O !  W H Y  A R E  T H E R E  S O  M A N Y  S P A C E S ?

I presume it's Unicode but I can't quite figure out how to read it as Unicode / convert it to ascii. Suggestions?

Try opening the file using codecs to make things easier.

Example:

import codecs
f = codecs.open('file.ini', encoding='utf-16-le')  # You can experiment with different encodings
for line in f:  # note, the readlines is not really needed
    print line,  # the comma strips the trailing newline in case that's bothering you

PS: if you don't know the encoding, I recommend looking at this question: Determine the encoding of text in Python

V eryregularspaces are usually an indicator that your data is encoded in UTF16 -- Usually what you see is that every second byte is a 0 byte. You can confirm this by printing out the actual binary data that you are reading:

f = open('file.ini')
line in f.readline():
print map(ord, line)

If you see output like this:

[..., 68, 0, 65, 0, 76, 0, 76, 0, 79, ...]

Then that's almost certainly the case.

The trick, then, is to figure out whether it's the even bytes that are 0s, or the odd bytes. There are two UTF-16 encodings: Big-endian and little-endian, named for the significance of the byte that comes first. If your 0s come before the character that they are associated with, then the file is big-endian, and you can open it like this (Python 3.x):

f = open('file.ini', encoding='utf16be')

In Python 2.x, import the codecs module to do this:

import codecs
f = codecs.open('file.ini', encoding='utf16be')

If the 0s come after, then substitude 'utf16le'.

(You need to make sure that you decode the file as you're reading it, or read the entire contents into memory before decoding. You definitly do not want to split lines apart before you decode)

If you're lucky, then the file was written with a Byte Order Mark at the beginning this character is U+FEFF-- if the first two bytes are [254, 255] , then the encoding is big-endian, and if [255, 254] , then it is little-endian.

If none of those apply, then you might not be looking at UTF-16 data, and you'll have to do some more research to figure out what encoding you're looking at.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM