简体   繁体   English

解码/编码 html Python 中的特殊字符转义

[英]Decode / encode html escaped special characters in Python

I have some text that has html escape codes in it that I am struggling to fully decode / encode to display properly with Python (ultimately in a Django application).我有一些包含 html 转义码的文本,我正在努力完全解码/编码以使用 Python 正确显示(最终在 Django 应用程序中)

""Coup d'État"" being a troublesome snippet.是一个麻烦的片段。

I have used html.unescape() to successfully unescape most of the html codes, but I am struggling with the decoding of the special characters, "É"我已经使用html.unescape()成功地取消了大多数 html 代码,但我在解码特殊字符"É"时遇到了困难, in this example. ,在这个例子中。 Ideally this would display as "Coup d'État" , but despite trying some decoding/encoding combinations I am getting "Coup d'Ãtat".理想情况下,这将显示为"Coup d'État" ,但尽管尝试了一些解码/编码组合,但我得到了 "Coup d'Ãtat"。

What is the correct way to convert ""Coup d'État""转换""Coup d'État""的正确方法是什么into "Coup d'État" ?进入"Coup d'État"

Thanks for your help, and apologies if this has been answered elsewhere.感谢您的帮助,如果这已在其他地方得到解答,我们深表歉意。 I've tried searching, but no success.我试过搜索,但没有成功。

You have a Mojibake , double-encoded data.你有一个Mojibake ,双编码数据。 You not only have HTML entities, your data was incorrectly decoded from bytes to text before the HTML entities were applied.您不仅拥有 HTML 实体,而且在应用 HTML 实体之前,您的数据被错误地从字节解码为文本。

For your example, the two Ã对于您的示例,两个à , ‰ , ‰ entities decode to the Unicode characters à and .实体解码为 Unicode 字符à Those two characters are also known (from the Unicode standard), as U+00C3 LATIN CAPITAL LETTER A WITH TILDE and U+2030 PER MILLE SIGN .这两个字符也是已知的(来自 Unicode 标准),如U+00C3 LATIN CAPITAL LETTER A WITH TILDEU+2030 PER MILLE SIGN This is typical of UTF-8 data being mis-interpreted as a Latin variant encoding (such as ISO 8859-1 or a Windows Latin codepage variant .这是 UTF-8 数据被错误解释为拉丁变体编码(例如ISO 8859-1Windows 拉丁代码页变体)的典型情况。

If we assume that the original character was meant to be É , or U+00C9 LATIN CAPITAL LETTER E WITH ACUTE , then the original would have been encoded to the bytes C3 and 89 if using UTF-8.如果我们假设原始字符是ÉU+00C9 LATIN CAPITAL LETTER E WITH ACUTE ,那么如果使用 UTF-8,原始字符将被编码为字节C389 That à ( U+00C3 !) shows up here is not a coincidence , it is typical of UTF-8 -> Latin variant Mojibakes to end up with such combinations. à ( U+00C3 !) 出现在这里并不是巧合,它是典型的 UTF-8 -> 拉丁变体 Mojibakes 以这种组合结束。 The 89 mapping tells us that the most likely candidate for the wrong encoding is the Windows CP 1252 encoding , which maps the hex value 89 to U+2030 PER MILLE SIGN . 89映射告诉我们最有可能出现错误编码的候选者是Windows CP 1252 编码,它将十六进制值89映射到U+2030 PER MILLE SIGN

You could manually encode to bytes then decode as the correct encoding, but the trick is to know what encoding was used incorrectly , and sometimes that mistake leads to data loss, because the CP-1252 codepage doesn't have a Unicode character mapping for 5 specific byte values.您可以手动编码为字节,然后解码为正确的编码,但诀窍是要知道错误使用了哪种编码,有时该错误会导致数据丢失,因为 CP-1252 代码页没有 Unicode 字符映射 5特定的字节值。 That's not a direct problem for the example in your question, but can be for other text.对于您问题中的示例,这不是直接问题,但可能是其他文本。 Manually decoding would work like this:手动解码将像这样工作:

>>> import html
>>> broken = ""Coup d'État""
>>> html.unescape(broken)
'"Coup d\'État"'
>>> html.unescape(broken).encode("cp1252")
b'"Coup d\'\xc3\x89tat"'
>>> html.unescape(broken).encode("cp1252").decode("utf-8")
'"Coup d\'État"'

A better option is to use the special ftfy library (the name is an acronym for Fixed That For You ), which uses detailed knowledge about how to recognize such mistakes and undo the damage.更好的选择是使用特殊的ftfy(名称是Fixed That For You的首字母缩写词),它使用有关如何识别此类错误并消除损坏的详细知识。

ftfy also handles the HTML-entity decoding, all in one step: ftfy还可以一步处理 HTML 实体解码:

>>> import ftfy
>>> ftfy.fix_text(""Coup d'État"")
'"Coup d\'État"'

The library includes sloppy variants of text codes often found in a Mojibake to help with repairing.该库包括在 Mojibake 中经常发现的用于帮助修复的草率文本代码变体 It also encodes information about how to recognize the specific errors that a given wrong codec choice produces so it knows what to do to reverse the damage.它还对有关如何识别给定错误编解码器选择产生的特定错误的信息进行编码,因此它知道如何做才能扭转损害。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM