Character Encoding, XML, Excel, python

Question

I am reading a list of strings that were imported into an excel xml file from another software program. I am not sure what the encoding of the excel file is, but I am pretty sure its not windows-1252, because when I try to use that encoding, I wind up with a lot of errors.

The specific word that is causing me trouble right now is: "Zmysłowska, Magdalena" (notice the "l" is not a standard "l", but rather, has a slash through it).

I have tried a few things, Ill mention three of them here:

(1)

page = unicode(page, "utf-8")
page = unicodedata.normalize("NFKD", page)
page = page.encode("utf-8", "ignore")

Output: Zmys\xc5\x82owska, Magdalena
Output after print statement: ZmysÅ‚owska, Magdalena

(2)

page = unicode(page, "utf-8")
page = unicodedata.normalize("NFKD", page)

Output: Zmys\u0142owska, Magdalena
Output after print statment: Zmysłowska, Magdalena

Note: this is great, but I need to encode it back to utf-8 before putting the string into my     db.  When I do that, by running page.encode("utf-8", "ignore"), I end up with ZmysÅ‚owska, Magdalena again.

(3) Do nothing (no normalization, no decode, no encode). It seems like the string is already utf-8 when it comes in. However, when I do nothing, the string ends up with the following output again:

Output: Zmys\xc5\x82owska, Magdalena
Output after print statement: ZmysÅ‚owska, Magdalena

Is there a way for me to convert this string to utf-8?

Answer 1

Your problem isn't your encoding and decoding. Your code correctly takes a UTF-8 string, and converts it to an NFKD-normalized UTF-8 string. (You might want to use page.decode("utf-8") instead of unicode(page, "utf-8") just for future-proofing in case you ever go to Python 3, and to make the code a bit easier to read because the encode and decode are more obviously parallel, but you don't have to; the two are equivalent.)

Your actually problem is that you're printing UTF-8 strings to some context that isn't UTF-8. Most likely you're printing to the cmd window, which is defaulting to Windows-1252. So, cmd tries to interpret the UTF-8 characters as Windows-1252, and gets garbage.

There's a pretty easy way to test this. Make Python decode the UTF-8 string as if it were Windows-1252 and see if the resulting Unicode string looks like what're seeing.

>>> print page.decode('windows-1252')
ZmysÅ‚owska, Magdalena

>>> print repr(page.decode('windows-1252'))
u'Zmys\xc5\u201aowska, Magdalena'

There are two ways around this:

Print Unicode strings and let Python take care of it.
Print strings converted to the appropriate encoding.

For option 1:

print page.decode("utf-8") # of unicode(page, "utf-8")

For option 2, it's going to be one of the following:

print page.decode("utf-8").encode("windows-1252")
print page.decode("utf-8").encode(sys.getdefaultencoding())

Of course if you keep the intermediate Unicode string around, you don't need all those decode calls:

upage = page.decode("utf-8")
upage = unicodedata.normalize("NFKD", upage)
page = upage.encode("utf-8", "ignore")

print upage

Character Encoding, XML, Excel, python

Question

1 answers

solution1
2 ACCPTED 2012-12-17 21:11:25

Character Encoding, XML, Excel, python

Question

1 answers

solution1 2 ACCPTED 2012-12-17 21:11:25

solution1
2 ACCPTED 2012-12-17 21:11:25