简体   繁体   English

python ascii代码到utf

[英]python ascii codes to utf

So when i post a name or text in mod_python in my native language i get: 因此,当我用我的母语在mod_python中发布名称或文本时,我得到:

македонија

And i also get: 我也得到:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)

When i use: 当我使用:

hparser = HTMLParser.HTMLParser() 
    req.write(hparser.unescape(text)) 

How can i decode it? 我怎么解码呢?

It's hard to explain UnicodeError s if you don't understand the underlying mechanism. 如果你不理解底层机制,很难解释UnicodeError You should really read either or both of 你应该真的读过其中一个或两个

In a (very small) nutshell, a Unicode code point is an abstract "thingy" representing one character 1 . 在(非常小的)简言之,Unicode代码点是表示一个字符1的抽象“thingy”。 Programmers like to work with these, because we like to think of strings as coming one character at a time. 程序员喜欢使用这些,因为我们喜欢将字符串视为一次出现一个字符。 Unfortunately, it was decreed a long time ago that a character must fit in one byte of memory, so there can be at most 256 different characters. 不幸的是,很久以前一个字符必须符合一个字节的内存,所以最多可以有256个不同的字符。 Which is fine for plain English, but doesn't work for anything else. 这对普通英语来说很好,但对其他任何东西都不起作用。 There's a global list of code points -- thousands of them -- which are meant to hold every possible character, but clearly they don't fit in a byte. 有一个代码点的全局列表 - 数千个 - 用于保存每个可能的字符,但显然它们不适合一个字节。

The solution: there is a difference between the ordered list of code points that make a string, and its encoding as a sequence of bytes. 解决方案:生成字符串的有序代码点列表与作为字节序列的编码之间存在差异。 You have to be clear whenever you work with a string which of these forms it should be in. 每当你使用字符串时,你必须清楚它应该是这些形式。

To convert between the forms you can .encode() a list of code points (a Unicode string) as a list of bytes, and .decode() bytes into a list of code points. 要在表单之间进行转换,您可以.encode()将代码点列表(Unicode字符串)作为字节列表,将.decode()字节转换为代码点列表。 To do so, you need to know how to map code points into bytes and vice versa, which is the encoding. 为此,您需要知道如何将代码点映射到字节,反之亦然,这是编码。 If you don't specify one, Python 2.x will guess that you meant ASCII. 如果你没有指定一个,Python 2.x会猜测你的意思是ASCII。 If that guess is wrong, you will get a UnicodeError . 如果猜测错误,您将获得UnicodeError

Note that Python 3.x is much better at handling Unicode strings, because the distinction between bytes and code points is much more clear cut. 请注意,Python 3.x在处理Unicode字符串方面要好得多,因为字节和代码点之间的区别要清晰得多。

1 Sort of. 1种。


EDIT: I guess I should point out how this helps. 编辑:我想我应该指出这有什么帮助。 But you really should read the above links! 但你真的应该阅读上面的链接! Just throwing in .encode() s and .decode() s everywhere is a terrible way to code, and one day you'll get bitten by a worse problem. 只是在.encode()投入.encode().decode()都是一种可怕的编码方式,有一天你会被一个更糟糕的问题所困扰。

Anyway, if you step through what you're doing in the shell you'll see 无论如何,如果你逐步完成你在shell中所做的事情,你会看到

>>> from HTMLParser import HTMLParser
>>> text = "македонија"
>>> hparser = HTMLParser()
>>> text = hparser.unescape(text)
>>> text
u'\u043c\u0430\u043a\u0435\u0434\u043e\u043d\u0438\u0458\u0430'

I'm using Python 2.7 here, so that's a Unicode string ie a sequence of Unicode code points. 我在这里使用Python 2.7,因此这是一个Unicode字符串,即一系列Unicode代码点。 We can encode them into a regular string (ie a list of bytes) like 我们可以将它们编码为常规字符串(即字节列表)

>>> text.encode("utf-8")
'\xd0\xbc\xd0\xb0\xd0\xba\xd0\xb5\xd0\xb4\xd0\xbe\xd0\xbd\xd0\xb8\xd1\x98\xd0\xb0'

But we could also pick a different encoding! 但我们也可以选择不同的编码!

>>> text.encode("utf-16")
'\xff\xfe<\x040\x04:\x045\x044\x04>\x04=\x048\x04X\x040\x04'

You'll need to decide what encoding you want to use. 您需要确定要使用的编码。

What went wrong when you did it? 你做错了什么出了什么问题? Well, not every encoding understands every code point. 好吧,并非每个编码都能理解每个代码点。 In particular, the "ascii" encoding only understands the first 256! 特别是, "ascii"编码只能理解前256个! So if you try 所以,如果你尝试

>>> text.encode("ascii")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)

you just get an error, because you can't encode those code points in ASCII. 你只是得到一个错误,因为你不能用ASCII编码那些代码点。

When you do req.write , you are trying to write a list of code points down the request. 当你执行req.write ,你试图在请求下写一个代码点列表。 But HTML requests don't understand code points: they just use ASCII. 但HTML请求不理解代码点:它们只使用ASCII。 Python 2 will try to be helpful by automatically ASCII-encoding your Unicode strings, which is fine if they really are ASCII but not if they aren't. Python 2将尝试通过自动对Unicode字符串进行ASCII编码来提供帮助,如果它们确实是ASCII,则可以正常使用,但如果它们不是ASCII则不行。

So you need to do req.write(hparser.unescape(text).encode("some-encoding")) . 所以你需要做req.write(hparser.unescape(text).encode("some-encoding"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM