Python編碼/解碼問題

Question

如何將諸如“weren\\xe2\\x80\\x99t”之類的字符串解碼回正常編碼。

所以這個詞實際上不是不是“weren\\xe2\\x80\\x99t”？ 例如：

print "\xe2\x80\x9cThings"
string = "\xe2\x80\x9cThings"
print string.decode('utf-8')
print string.encode('ascii', 'ignore')

â€œThings
“Things
Things

但我實際上想得到“東西。

或者：

print "weren\xe2\x80\x99t"
string = "weren\xe2\x80\x99t"
print string.decode('utf-8')
print string.encode('ascii', 'ignore')

werenâ€™t
weren’t
werent

但我實際上想要得到不是。

我該怎么做？

Answer 1

我映射了最常見的奇怪字符，因此這是基於 Oliver W. 答案的非常完整的答案。

這個功能絕不是理想的，但它是最好的起點。 還有更多字符定義：

http://utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=string
http://www.utf8-chartable.de/unicode-utf8-table.pl?start=128&number=128&names=-&utf8=string-literal

...

def unicodetoascii(text):

    uni2ascii = {
            ord('\xe2\x80\x99'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\x9c'.decode('utf-8')): ord('"'),
            ord('\xe2\x80\x9d'.decode('utf-8')): ord('"'),
            ord('\xe2\x80\x9e'.decode('utf-8')): ord('"'),
            ord('\xe2\x80\x9f'.decode('utf-8')): ord('"'),
            ord('\xc3\xa9'.decode('utf-8')): ord('e'),
            ord('\xe2\x80\x9c'.decode('utf-8')): ord('"'),
            ord('\xe2\x80\x93'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x92'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x94'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x94'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x98'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\x9b'.decode('utf-8')): ord("'"),

            ord('\xe2\x80\x90'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x91'.decode('utf-8')): ord('-'),

            ord('\xe2\x80\xb2'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb3'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb4'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb5'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb6'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb7'.decode('utf-8')): ord("'"),

            ord('\xe2\x81\xba'.decode('utf-8')): ord("+"),
            ord('\xe2\x81\xbb'.decode('utf-8')): ord("-"),
            ord('\xe2\x81\xbc'.decode('utf-8')): ord("="),
            ord('\xe2\x81\xbd'.decode('utf-8')): ord("("),
            ord('\xe2\x81\xbe'.decode('utf-8')): ord(")"),

                            }
    return text.decode('utf-8').translate(uni2ascii).encode('ascii')

print unicodetoascii("weren\xe2\x80\x99t")

Answer 2

您應該提供一個將 unicode 字符映射到其他 unicode 字符的轉換映射（如果要重新編碼，后者應該在 ASCII 范圍內）：

uni2ascii = {ord('\xe2\x80\x99'.decode('utf-8')): ord("'")}    
yourstring.decode('utf-8').translate(uni2ascii).encode('ascii')
print(yourstring)  # prints: "weren't"

Answer 3

在 Python 3 中，我會這樣做：

string = "\xe2\x80\x9cThings"
bytes_string = bytes(string, encoding="raw_unicode_escape")
happy_result = bytes_string.decode("utf-8", "strict")
print(happy_result)

不需要翻譯地圖，只需代碼:)

Python編碼/解碼問題

問題描述

3 個解決方案

解決方案1
10 已采納 2015-01-18 00:47:13

解決方案2
1 2015-01-17 12:58:13

解決方案3
1 2021-03-26 10:58:59

Python編碼/解碼問題

問題描述

3 個解決方案

解決方案1 10 已采納 2015-01-18 00:47:13

解決方案2 1 2015-01-17 12:58:13

解決方案3 1 2021-03-26 10:58:59

解決方案1
10 已采納 2015-01-18 00:47:13

解決方案2
1 2015-01-17 12:58:13

解決方案3
1 2021-03-26 10:58:59