简体   繁体   English

Django是否对Unicode(utf-8?)字符串进行双重编码?

[英]Is Django double encoding a Unicode (utf-8?) string?

I'm having trouble storing and outputting an ndash character as UTF-8 in Django. 我在Django中存储并输出一个ndash字符作为UTF-8时遇到了麻烦。

I'm getting data from an API. 我从API获取数据。 In raw form, as retrieved and viewed in a text editor, given unit of data may be similar to: 在原始格式中,在文本编辑器中检索和查看时,给定的数据单元可能类似于:

"I love this detergent \u2013 it is so inspiring." 

(\– is & ndash; as an html entity). (\\ u2013–作为html实体)。

If I get this straight from an API and display it in Django, no problem. 如果我直接从API获得并在Django中显示它,没问题。 It displays in my browser as a long dash. 它在我的浏览器中显示为长划线。 I noticed I have to do decode('utf-8') to avoid the "'ascii' codec can't encode character" error if I try to do some operations with that text in my view, though. 我注意到我必须进行decode('utf-8')以避免“'ascii'编解码器无法编码字符”错误,如果我尝试在我的视图中对该文本执行某些操作。 The text is going to the template as "I love this detergent\– it is so inspiring.", according to the Django Debug Toolbar. 根据Django调试工具栏的说法,文本将作为“我喜欢这种洗涤剂,它非常鼓舞人心。”的模板。

When stored to MySQL and read for output through the same view and template, however, it ends up looking like 然而,当存储到MySQL并通过相同的视图和模板读取输出时,它最终看起来像

"I love this detergent – it is so inspiring"

My MySQL table is set to DEFAULT CHARSET=utf8 . 我的MySQL表设置为DEFAULT CHARSET=utf8

Now, when I read the data from the database through the MysQl monitor in a terminal set to Utf-8, it shows up as 现在,当我通过设置为Utf-8的终端中的MysQl监视器从数据库中读取数据时,它显示为

"I love this detergent – it is so inspiring" 

(correct - shows an ndash) (正确 - 显示ndash)

When I use mysqldb in a python shell, this line is 当我在python shell中使用mysqldb时,这一行是

"I love this detergent \xe2\x80\x93 it is so inspiring" 

(this is the correct UTF-8 for an ndash) (这是ndash的正确UTF-8)

However , if I run python manage.py shell , and then 但是 ,如果我运行python manage.py shell ,然后

In [1]: import myproject.myapp.models ThatTable
In [2]: msg=ThatTable.objects.all().filter(thefield__contains='detergent')
In [3]: msg
Out[4]: [{'thefield': 'I love this detergent \xc3\xa2\xe2\x82\xac\xe2\x80\x9c it is so inspiring'}]

It appears to me that Django has taken \\xe2\\x80\\x93 to mean three separate characters, and encoded it as UTF-8 into \\xc3\\xa2\\xe2\\x82\\xac\\xe2\\x80\\x9c . 在我看来,Django已经将\\xe2\\x80\\x93表示为三个单独的字符,并将其编码为UTF-8到\\xc3\\xa2\\xe2\\x82\\xac\\xe2\\x80\\x9c This displays as – because \\xe2 appears to be â, \\x80 appears to be €, etc. I've checked and this is how it's being sent to the template, as well. 这显示为 - 因为\\ xe2似乎是â,\\ x80似乎是€等等。我已经检查过,这也是它被发送到模板的方式。

If you decode the long sequence in Python, though, with decode('utf-8') , the result is \\xe2\€\“ which also renders in the browser as –. 但是,如果使用decode('utf-8') Python中的长序列,结果是\\xe2\€\“ ,它也会在浏览器中呈现为 - 。 Trying to decode it again yields a UnicodeDecodeError. 尝试再次解码会产生UnicodeDecodeError。

I've followed the Django suggestions for Unicode , as far as I know (configured MySQL). 据我所知,我已经遵循了Django对Unicode的建议 (配置了MySQL)。

Any suggestions on what I may have misconfigured? 关于我可能错误配置的任何建议?

addendum It seems this same issue has cropped up in other areas or systems as well., as while searching for \\xc3\\xa2\\xe2\\x82\\xac\\xe2\\x80\\x9c, I found at http://pastie.org/908443.txt a script to 'repair bad UTF8 entities.', also found in a wordpress RSS import plug in. It simply replaces this sequence with –. 增编似乎同样的问题在其他领域或系统也随之而来还有,作为同时寻找\\ XC3 \\ XA2 \\ XE2 \\ X82 \\西飞\\ XE2 \\ X80 \\ x9c,我发现在http://pastie.org/ 908443.txt一个脚本来“修复坏的UTF8实体。”,也可以在wordpress RSS导入插件中找到。它只是用 - 替换这个序列。 I'd like to solve this the right way, though! 不过,我想以正确的方式解决这个问题!

Oh, and I'm using Django 1.2 and Python 2.6.5. 哦,我正在使用Django 1.2和Python 2.6.5。

I can connect to the same database with PHP/PDO and print out this data without doing anything special, and it looks fine. 我可以使用PHP / PDO连接到同一个数据库并打印出这些数据而不做任何特殊操作,看起来很好。

This does seem like a case of double-encoding; 这似乎是双重编码的情况; I don't have much experience with Python, but try adjusting the MySQL connection settings as per the advice at http://tahpot.blogspot.com/2005/06/mysql-and-python-and-unicode.html 我对Python没有太多经验,但请根据http://tahpot.blogspot.com/2005/06/mysql-and-python-and-unicode.html上的建议尝试调整MySQL连接设置。

What I'm guessing is happening is that the connection is latin1, so MySQL tries to encode the string again before storage to the UTF-8 field. 我猜测正在发生的是连接是latin1,所以MySQL尝试在存储到UTF-8字段之前再次对字符串进行编码。 The code there, specifically this bit: 那里的代码,特别是这一点:

EDIT: With Python when establishing a database connection add the following flag: init_command='SET NAMES utf8'. 编辑:使用Python建立数据库连接时添加以下标志:init_command ='SET NAMES utf8'。

In addition set the following in MySQL's my.cnf: default-character-set = utf8 另外在MySQL的my.cnf中设置以下内容:default-character-set = utf8

is probably what you want. 可能就是你想要的。

I added set names utf8 to my php data insertion sequence, and now in a Python shell the feared ndash shows up as \\x96. 我在我的php数据插入序列中添加了set names utf8 ,现在在Python shell中,可怕的ndash显示为\\ x96。 This renders correctly when read and output through Django. 当通过Django读取和输出时,这会正确呈现。

One unusual situation about this is that I'm inserting data through PHP. 关于这一点的一个不寻常的情况是我通过PHP插入数据。 Django issues set names utf8 automatically, so likely if I was inserting and reading the data through Django, this issue would not have appeared. Django会自动set names utf8 ,所以如果我通过Django插入和读取数据,那么这个问题就不会出现了。 PHP was using the default of latin1, I suppose 我想,PHP使用的是默认的latin1

As an interesting note, while before I could read the data from PHP and it showed up normally in the browser, now the ndash is unless I call set names before reading the data. 有趣的是,在我从PHP读取数据并且它在浏览器中正常显示之前,现在ndash是 ,除非我在读取数据之前调用set names

So, it's working now and I hope I never have to understand whatever was going on before! 所以,它现在正在运作,我希望我从来不需要了解之前发生的事情!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM