我如何解碼這個utf-8字符串，在隨機網站上挑選並由Django ORM使用Python保存？

Question

我解析了一個文件並使用Django將其內容保存在數據庫中。 該網站是100％的英文，所以我天真地認為它一直是ASCII，並愉快地保存文本作為unicode。

你猜其余的故事:-)

當我打印時，我得到通常的編碼錯誤：

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 48: ordinal not in range(128)

快速搜索告訴我，''u2019'是'的'UTF-8表示' 。

repr(string)顯示我：

"u'his son\\u2019s friend'"

然后當然我嘗試了django.utils.encoding.smart_str和一個更直接的方法使用string.encode（'utf-8'），我最終得到了一些可打印的東西。 不幸的是，它在我的（linux UTF-8）終端中打印出來：

In [76]: repr(string.encode('utf-8'))
Out[76]: "'his son\\xe2\\x80\\x99s friend '"

In [77]: print string.encode('utf-8')
his son�s friend

不是我的預期。 我懷疑我對某些東西進行了雙重編碼或錯過了一個重點。

當然文件原始編碼不是與文件一起建立的。 我想我可以閱讀HTTP標題或詢問網站管理員，但由於\\ u2019s看起來像UTF-8，我認為它是utf-8。 我可能是非常錯的，告訴我，如果我。

解決方案顯然很受歡迎，但對原因的深刻解釋以及如何避免再次發生這種情況將會更多。 我經常被編碼所困擾，這表明我仍然沒有完全掌握主題。

Answer 1

你很好。 你有適當的數據。 是的，原始數據是UTF-8（基於上下文u2019作為“兒子”和“s”之間的撇號非常有意義）。 奇怪的? 錯誤字符可能只是意味着您的終端配置的字體沒有此字符的字形（花式撇號）。 沒什么大不了。 數據在重要的地方是正確的。 如果您感到緊張，請嘗試一些不同的終端/操作系統組合（我使用iTerm在OS X上）。 我花了很多時間向我的QA人解釋這可怕? 問號字符只是意味着他們的Windows框中沒有安裝中文字體（在我的情況下，我們使用中文數據進行測試）。 這是一些評論

#Create a Python Unicode object
#(abstract code points, independent of any encoding)
#single backslash tells python we want to represent
#a code point by its unicode code point number, typed out with ASCII numbers
>>> s1 = u'his son\u2019s friend'

#If you just type it at the prompt,
#the interpreter does the equivalent of `print repr(s1)`
#and since repr means "show it like a string typed into a python source file",
#you get your ASCII escaped version back
>>> s1
u'his son\u2019s friend'
>>> print repr(s1)
u'his son\u2019s friend'

#This isn't ASCII, so encoding into ASCII generates your original
#error as expected
>>> s1.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character
 u'\u2019' in position 7: 
ordinal not in range(128)

# Encode in UTF-8 and now we have a string,
# which gets displayed as hex escapes.     
#Unicode code point 2019 looks like it gets 3 bytes in UTF-8 (yup, it does)
>>> s1.encode('utf-8')
'his son\xe2\x80\x99s friend'

#My terminal DOES have a different glyph (symbol) to use here,
#so it displays OK for me.
#Note that my terminal has a different glyph for a normal ASCII apostrophe
#(straight vertical)
>>> print s1
his son’s friend
>>> repr(s1)
"u'his son\\u2019s friend'"
>>> str(s1.encode('utf-8'))
'his son\xe2\x80\x99s friend'

另見： http ： //www.cl.cam.ac.uk/~mgk25/ucs/quotes.html

另請參閱字符2019（十六進制的e28099，在此頁面上搜索“2019”）： http ：//www.utf8-chartable.de/unicode-utf8-table.pl？start = 8000

另請參見： http ： //www.joelonsoftware.com/articles/Unicode.html

Answer 2

也許我太天真，但是......是不是你的問題只是其躲過了領先\\ Unicode代碼點的？

您的原始字符串表現如下：

>>> s = u'his son\\u2019s friend'
>>> print(s)
his son\u2019s friend

但刪除轉義\\給出：

>>> s = u'his son\u2019s friend'
>>> print(s)
his son’s friend

Answer 3

嘗試調用這樣的python shell：

python2 -S -i -c 'import sys;sys.setdefaultencoding("utf-8");import site'

然后：

>>> s = u'his son\u2019s friend'
>>> print s.encode("utf-8")
his son’s friend

然后默認編碼是utf-8，它應該打印正常。

我如何解碼這個utf-8字符串，在隨機網站上挑選並由Django ORM使用Python保存？

問題描述

3 個解決方案

解決方案1
6 已采納 2011-07-07 06:20:16

解決方案2
1 2011-07-07 05:51:10

解決方案3
1 2011-07-07 06:20:20

我如何解碼這個utf-8字符串，在隨機網站上挑選並由Django ORM使用Python保存？

問題描述

3 個解決方案

解決方案1 6 已采納 2011-07-07 06:20:16

解決方案2 1 2011-07-07 05:51:10

解決方案3 1 2011-07-07 06:20:20

解決方案1
6 已采納 2011-07-07 06:20:16

解決方案2
1 2011-07-07 05:51:10

解決方案3
1 2011-07-07 06:20:20