简体   繁体   English

Python编码/解码问题

[英]Python encoding/decoding problems

How do I decode strings such as this one "weren\\xe2\\x80\\x99t" back to the normal encoding.如何将诸如“weren\\xe2\\x80\\x99t”之类的字符串解码回正常编码。

So this word is actually weren't and not "weren\\xe2\\x80\\x99t"?所以这个词实际上不是不是“weren\\xe2\\x80\\x99t”? For example:例如:

print "\xe2\x80\x9cThings"
string = "\xe2\x80\x9cThings"
print string.decode('utf-8')
print string.encode('ascii', 'ignore')

“Things
“Things
Things

But I actually want to get "Things.但我实际上想得到“东西。

or:或者:

print "weren\xe2\x80\x99t"
string = "weren\xe2\x80\x99t"
print string.decode('utf-8')
print string.encode('ascii', 'ignore')

weren’t
weren’t
werent

But I actually want to get weren't.但我实际上想要得到不是。

How should i do this?我该怎么做?

I mapped the most common strange chars so this is pretty much complete answer based on the Oliver W. answer.我映射了最常见的奇怪字符,因此这是基于 Oliver W. 答案的非常完整的答案。

This function is by no means ideal,but it is the best place to start with.这个功能绝不是理想的,但它是最好的起点。 There are more chars definitions:还有更多字符定义:

http://utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=string http://utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=string
http://www.utf8-chartable.de/unicode-utf8-table.pl?start=128&number=128&names=-&utf8=string-literal http://www.utf8-chartable.de/unicode-utf8-table.pl?start=128&number=128&names=-&utf8=string-literal

... ...

def unicodetoascii(text):

    uni2ascii = {
            ord('\xe2\x80\x99'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\x9c'.decode('utf-8')): ord('"'),
            ord('\xe2\x80\x9d'.decode('utf-8')): ord('"'),
            ord('\xe2\x80\x9e'.decode('utf-8')): ord('"'),
            ord('\xe2\x80\x9f'.decode('utf-8')): ord('"'),
            ord('\xc3\xa9'.decode('utf-8')): ord('e'),
            ord('\xe2\x80\x9c'.decode('utf-8')): ord('"'),
            ord('\xe2\x80\x93'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x92'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x94'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x94'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x98'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\x9b'.decode('utf-8')): ord("'"),

            ord('\xe2\x80\x90'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x91'.decode('utf-8')): ord('-'),

            ord('\xe2\x80\xb2'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb3'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb4'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb5'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb6'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb7'.decode('utf-8')): ord("'"),

            ord('\xe2\x81\xba'.decode('utf-8')): ord("+"),
            ord('\xe2\x81\xbb'.decode('utf-8')): ord("-"),
            ord('\xe2\x81\xbc'.decode('utf-8')): ord("="),
            ord('\xe2\x81\xbd'.decode('utf-8')): ord("("),
            ord('\xe2\x81\xbe'.decode('utf-8')): ord(")"),

                            }
    return text.decode('utf-8').translate(uni2ascii).encode('ascii')

print unicodetoascii("weren\xe2\x80\x99t")  

You should provide a translation map that maps unicode characters to other unicode characters (the latter should be within the ASCII range if you want to re-encode to it):您应该提供一个将 unicode 字符映射到其他 unicode 字符的转换映射(如果要重新编码,后者应该在 ASCII 范围内):

uni2ascii = {ord('\xe2\x80\x99'.decode('utf-8')): ord("'")}    
yourstring.decode('utf-8').translate(uni2ascii).encode('ascii')
print(yourstring)  # prints: "weren't"

In Python 3 I would do it like this:在 Python 3 中,我会这样做:

string = "\xe2\x80\x9cThings"
bytes_string = bytes(string, encoding="raw_unicode_escape")
happy_result = bytes_string.decode("utf-8", "strict")
print(happy_result)

No translation maps needed, just code :)不需要翻译地图,只需代码:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM