简体   繁体   中英

Is Django double encoding a Unicode (utf-8?) string?

I'm having trouble storing and outputting an ndash character as UTF-8 in Django.

I'm getting data from an API. In raw form, as retrieved and viewed in a text editor, given unit of data may be similar to:

"I love this detergent \u2013 it is so inspiring." 

(\– is & ndash; as an html entity).

If I get this straight from an API and display it in Django, no problem. It displays in my browser as a long dash. I noticed I have to do decode('utf-8') to avoid the "'ascii' codec can't encode character" error if I try to do some operations with that text in my view, though. The text is going to the template as "I love this detergent\– it is so inspiring.", according to the Django Debug Toolbar.

When stored to MySQL and read for output through the same view and template, however, it ends up looking like

"I love this detergent – it is so inspiring"

My MySQL table is set to DEFAULT CHARSET=utf8 .

Now, when I read the data from the database through the MysQl monitor in a terminal set to Utf-8, it shows up as

"I love this detergent – it is so inspiring" 

(correct - shows an ndash)

When I use mysqldb in a python shell, this line is

"I love this detergent \xe2\x80\x93 it is so inspiring" 

(this is the correct UTF-8 for an ndash)

However , if I run python manage.py shell , and then

In [1]: import myproject.myapp.models ThatTable
In [2]: msg=ThatTable.objects.all().filter(thefield__contains='detergent')
In [3]: msg
Out[4]: [{'thefield': 'I love this detergent \xc3\xa2\xe2\x82\xac\xe2\x80\x9c it is so inspiring'}]

It appears to me that Django has taken \\xe2\\x80\\x93 to mean three separate characters, and encoded it as UTF-8 into \\xc3\\xa2\\xe2\\x82\\xac\\xe2\\x80\\x9c . This displays as – because \\xe2 appears to be â, \\x80 appears to be €, etc. I've checked and this is how it's being sent to the template, as well.

If you decode the long sequence in Python, though, with decode('utf-8') , the result is \\xe2\€\“ which also renders in the browser as –. Trying to decode it again yields a UnicodeDecodeError.

I've followed the Django suggestions for Unicode , as far as I know (configured MySQL).

Any suggestions on what I may have misconfigured?

addendum It seems this same issue has cropped up in other areas or systems as well., as while searching for \\xc3\\xa2\\xe2\\x82\\xac\\xe2\\x80\\x9c, I found at http://pastie.org/908443.txt a script to 'repair bad UTF8 entities.', also found in a wordpress RSS import plug in. It simply replaces this sequence with –. I'd like to solve this the right way, though!

Oh, and I'm using Django 1.2 and Python 2.6.5.

I can connect to the same database with PHP/PDO and print out this data without doing anything special, and it looks fine.

This does seem like a case of double-encoding; I don't have much experience with Python, but try adjusting the MySQL connection settings as per the advice at http://tahpot.blogspot.com/2005/06/mysql-and-python-and-unicode.html

What I'm guessing is happening is that the connection is latin1, so MySQL tries to encode the string again before storage to the UTF-8 field. The code there, specifically this bit:

EDIT: With Python when establishing a database connection add the following flag: init_command='SET NAMES utf8'.

In addition set the following in MySQL's my.cnf: default-character-set = utf8

is probably what you want.

I added set names utf8 to my php data insertion sequence, and now in a Python shell the feared ndash shows up as \\x96. This renders correctly when read and output through Django.

One unusual situation about this is that I'm inserting data through PHP. Django issues set names utf8 automatically, so likely if I was inserting and reading the data through Django, this issue would not have appeared. PHP was using the default of latin1, I suppose

As an interesting note, while before I could read the data from PHP and it showed up normally in the browser, now the ndash is unless I call set names before reading the data.

So, it's working now and I hope I never have to understand whatever was going on before!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM