如何修復 Python 中損壞的 utf-8 編碼？

Question

我的字符串是Niá»‡m Bá»“ TÃ¡t (Thiá»n sÆ° Nháº¥t Háº¡nh)我想將它解碼為Niệm Bồ Tát (Thiền sư Nhất Hạnh) 。 我在那個網站上看到可以做到http://www.enderminh.com/minh/utf8-to-unicode-converter.aspx

我開始嘗試 Python

mystr = '09. BÃ¡t NhÃ£ TÃ¢m Kinh'
mystr.decode('utf-8')

但實際上它是不正確的，因為原始字符串是 utf-8 但字符串顯示不是我的預期結果。

注：是越南字。

如何解決這種情況？ 那是 Windows Unicode 還是什么？ 如何在此處檢測編碼。

Answer 1

唯一幫助我消除西里爾字母字符串的東西-https: //github.com/LuminosoInsight/python-ftfy

這個模塊幾乎可以解決所有問題，並且比在線解碼器要好得多。

>>> from ftfy import fix_encoding
>>> mystr = '09. BÃ¡t NhÃ£ TÃ¢m Kinh'
>>> fix_encoding(mystr)
'09. Bát Nhã Tâm Kinh'

可以使用pip install ftfy輕松安裝

Answer 2

我不確定您可以使用這些數據做什么，但是對於您在原始帖子中的示例來說，它可以工作：

>>> mystr = '09. BÃ¡t NhÃ£ TÃ¢m Kinh'
>>> s = mystr.decode('utf8').encode('latin1').decode('utf8')
>>> s
u'09. B\xe1t Nh\xe3 T\xe2m Kinh'
>>> print(s)
09. Bát Nhã Tâm Kinh

Answer 3

嘗試：

str.encode('ascii', 'ignore').decode('utf-8')

您正在以ASCII格式編碼字符串/忽略錯誤並以UTF-8解碼。 這可以消除重音，但這是一種方法。

Answer 4

python 3.9.6 中的正確方法是：

"string".encode('utf-8').decode('latin-1')

"string".encode('latin1').decode('utf8')

如何修復 Python 中損壞的 utf-8 編碼？

問題描述

4 個解決方案

解決方案1
12 2016-10-06 19:42:29

解決方案2
10 已采納 2014-10-21 17:27:17

解決方案3
0 2019-10-15 02:34:46

解決方案4
0 2022-08-17 21:46:11

如何修復 Python 中損壞的 utf-8 編碼？

問題描述

4 個解決方案

解決方案1 12 2016-10-06 19:42:29

解決方案2 10 已采納 2014-10-21 17:27:17

解決方案3 0 2019-10-15 02:34:46

解決方案4 0 2022-08-17 21:46:11

解決方案1
12 2016-10-06 19:42:29

解決方案2
10 已采納 2014-10-21 17:27:17

解決方案3
0 2019-10-15 02:34:46

解決方案4
0 2022-08-17 21:46:11