简体   繁体   中英

python, UnicodeEncodeError, converting unicode to ascii

Firstly, I am pretty new to python, so forgive me for all the n00b stuff. So the application logic in Python goes like this:

  1. I am sending and SQL Select to database and it returns an array of data.
  2. I need to take this data and use it in another SQL insert sentence.

Now the problem is, that SQL query returns me unicode strings. The output from select is something like this:

(u'Abc', u'Lololo', u'Fjordk\xe6r')

So first I was trying to convert it string, but it fails as the third element contains this german 'ae' letter:

for x in data[0]:
    str_data.append(str(x))

I am getting: UnicodeEncodeError: 'ascii' codec can't encode character u'\\xe6' in position 6: ordinal not in range(128)

I can insert unicode straightly to insert also as TypeError occurs. TypeError: coercing to Unicode: need string or buffer, NoneType found

Any ideas?

From my experiences, Python and Unicode are often a problem.

Generally speaking, if you have a Unicode string, you can convert it to a normal string like this:

normal_string = unicode_string.encode('utf-8')

And convert a normal string to a Unicode string like this:

unicode_string = normal_string.decode('utf-8')

The issue here is that str function tries to convert unicode using ascii codepage, and ascii codepage doesn't have mapping for u\\xe6 (æ - char reference here ).

Therefore you need to convert it to some codepage which supports the char. Nowdays the most usual is utf-8 encoding.

>>> x = (u'Abc', u'Lololo', u'Fjordk\xe6r')
>>> print x[2].encode("utf8")
Fjordkær
>>> x[2].encode("utf-8")
'Fjordk\xc3\xa6r'

On the other hand you may try to convert it to cp1252 - Western latin alphabet which supports it:

>>> x[2].encode("cp1252")
'Fjordk\xe6r'

But Eeaster european charset cp1250 doesn't support it:

>>> x[2].encode("cp1250")
...
UnicodeEncodeError: 'charmap' codec can't encode character u'\xe6' in position 6: character maps to <undefined>

The issue with unicode in python is very common, and I would suggest following:

  • understand what unicode is
  • understand what utf-8 is (it is not unicode)
  • understand ascii and other codepages
  • recommended conversion workflow: input (any cp) -> convert to unicode -> (process) -> output to utf-8

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM