简体   繁体   English

Python ascii utf unicode

[英]Python ascii utf unicode

When I parse this XML with p = xml.parsers.expat.ParserCreate() :当我用p = xml.parsers.expat.ParserCreate()解析这个 XML 时:

<name>Fortuna D&#252;sseldorf</name>

The character parsing event handler includes u'\\xfc' .字符解析事件处理程序包括u'\\xfc'

How can u'\\xfc' be turned into u'ü' ?如何将u'\\xfc'变成u'ü'


This is the main question in this post, the rest just shows further (ranting) thoughts about it这是这篇文章的主要问题,其余的只是显示了关于它的进一步(咆哮)想法

Isn't Python unicode broken since u'\\xfc' shall yield u'ü' and nothing else?因为u'\\xfc'应该产生u'ü'而没有别的,Python unicode 不是被破坏了吗? u'\\xfc' is already a unicode string, so converting it to unicode again doesn't work! u'\\xfc' 已经是一个 unicode 字符串,因此再次将其转换为 unicode 不起作用! Converting it to ASCII as well doesn't work.将其转换为 ASCII 也不起作用。

The only thing that I found works is: (This cannot be intended, right?)我发现唯一有效的是:(这不可能是故意的,对吧?)

exec( 'print u\'' + 'Fortuna D\xfcsseldorf'.decode('8859') + u'\'')

Replacing 8859 with utf-8 fails!用 utf-8 替换 8859 失败! What is the point of that?这样做有什么意义?

Also what is the point of the Python unicode HOWTO?另外,Python unicode HOWTO 的重点是什么? - it only gives examples of fails instead of showing how to do the conversions one (especially the houndreds of ppl who ask similar questions here) actually use in real world practice. - 它只给出了失败的例子,而不是展示如何进行转换(尤其是在这里提出类似问题的成百上千的人)在现实世界实践中实际使用的转换。

Unicode is no magic - why do so many ppl here have issues? Unicode 不是魔法 - 为什么这里有这么多人有问题?

The underlying problem of unicode conversion is dirt simple: unicode 转换的潜在问题很简单:

One bidirectional lookup table '\\xFC' <-> u'ü'一张双向查找表 '\\xFC' <-> u'ü'

unicode( 'Fortuna D\xfcsseldorf' ) 

What is the reason why the creators of Python think it is better to show an error instead of simply producing this: u'Fortuna Düsseldorf' ?为什么 Python 的创建者认为最好显示错误而不是简单地生成这个: u'Fortuna Düsseldorf'什么?

Also why did they made it not reversible?:还有为什么他们使它不可逆?:

 >>> u'Fortuna Düsseldorf'.encode('utf-8')
 'Fortuna D\xc3\xbcsseldorf'
 >>> unicode('Fortuna D\xc3\xbcsseldorf','utf-8')
 u'Fortuna D\xfcsseldorf'    

You already have the value .已经拥有了价值 Python simply tries to make debugging easier by giving you a representation that is ASCII friendly. Python 只是通过为您提供 ASCII 友好的表示来尝试使调试更容易。 Echoing values in the interpreter gives you the result of calling repr() on the result.在解释器中回显值为您提供对结果调用repr()的结果。

In other words, you are confusing the representation of the value with the value itself.换句话说,您将值的表示与值本身混淆了。 The representation is designed to be safely copied and pasted around, without worry about how other systems might handle non-ASCII codepoints.该表示旨在安全地复制和粘贴,而无需担心其他系统可能如何处理非 ASCII 代码点。 As such the Python string literal syntax is used, with any non-printable and non-ASCII characters replaced by \\xhh and \\uhhhh escape sequences.因此,使用 Python 字符串文字语法,将任何不可打印和非 ASCII 字符替换为\\xhh\\uhhhh \\xhh转义序列。 Pasting those strings back into a Python string or interactive Python session will reproduce the exact same value.将这些字符串粘贴回 Python 字符串或交互式 Python 会话将重现完全相同的值。

As such ü has been replaced by \\xfc , because that's the Unicode codepoint for the U+00FC LATIN SMALL LETTER U WITH DIAERESIS codepoint.因此ü已被替换为\\xfc ,因为这是U+00FC LATIN SMALL LETTER U WITH DIAERESIS代码点的Unicode 代码点。

If your terminal is configured correctly, you can just use print and Python will encode the Unicode value to your terminal codec, resulting in your terminal display giving you the non-ASCII glyphs:如果您的终端配置正确,您只需使用print并且 Python 会将 Unicode 值编码为您的终端编解码器,从而导致您的终端显示为您提供非 ASCII 字形:

>>> u'Fortuna Düsseldorf'
u'Fortuna D\xfcsseldorf'
>>> print u'Fortuna Düsseldorf'
Fortuna Düsseldorf

If your terminal is configured for UTF-8, you can also write the UTF-8 bytes directly to your terminal, after encoding explicitly:如果您的终端配置为 UTF-8,您还可以在显式编码后将 UTF-8 字节直接写入终端:

>>> u'Fortuna Düsseldorf'.encode('utf8')
'Fortuna D\xc3\xbcsseldorf'
>>> print u'Fortuna Düsseldorf'.encode('utf8')
Fortuna Düsseldorf

The alternative is for you upgrade to Python 3;另一种方法是让您升级到 Python 3; there repr() only uses escape sequences for codepoints that have no printable glyphs (control codes, reserved codepoints, surrogates, etc; if the codepoint is not a space but falls in a C* or Z* general category, it is escaped). repr()仅对没有可打印字形的代码点使用转义序列(控制代码、保留代码点、代理等;如果代码点不是空格而是属于C*Z*一般类别,则将其转义)。 The new ascii() function gives you the Python 2 repr() behaviour still.新的ascii()函数仍然为您提供 Python 2 repr()行为。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM