如何在python中规范化iso-8859-15转换的unicode编码？

Question

I want to convert unicode string into iso-8859-15. 我想将unicode字符串转换为iso-8859-15。 These strings include the u"\’" (RIGHT SINGLE QUOTATION MARK see http://www.fileformat.info/info/unicode/char/2019/index.htm ) character which is not part of the iso-8859-15 characters set. 这些字符串包括u"\’" （RIGHT SINGLE QUOTATION MARK参见http://www.fileformat.info/info/unicode/char/2019/index.htm ）字符，它不是iso-8859-15字符的一部分组。

In Python, how to normalize the unicode characters in order to match the iso-8859-15 encoding? 在Python中，如何规范化unicode字符以匹配iso-8859-15编码？

I have looked at the unicodedata module without success. 我看过unicodedata模块没有成功。 I manage to do the job with 我设法完成这项工作

s.replace(u"\u2019", "'").encode('iso-8859-15')

but I would like to find a more general and cleaner way. 但我想找到一种更通用，更清洁的方式。

Thanks for your help 谢谢你的帮助

Answer 1

Unless you wish to create a translation rule (if you do, look at Boud's answer), you could choose one of the default error handlers encode provides or even register your own one: 除非您希望创建转换规则（如果您这样做，请查看Boud的答案），您可以选择其中一个默认错误处理程序encode提供甚至注册您自己的错误处理程序：

In [4]: u'\u2019 Hi'.encode('iso-8859-15', 'replace')
Out[4]: '? Hi'

In [5]: u'\u2019 Hi'.encode('iso-8859-15', 'ignore')
Out[5]: ' Hi'

In [6]: u'\u2019 Hi'.encode('iso-8859-15', 'xmlcharrefreplace')
Out[6]: '&#8217; Hi'

From encode docstring: 来自encode docstring：

S.encode([encoding[,errors]]) -> string or unicode S.encode（[encoding [，errors]]） - >字符串或unicode

Encodes S using the codec registered for encoding. 使用注册用于编码的编解码器对S进行编码。 encoding defaults to the default encoding. encoding默认为默认编码。 errors may be given to set a different error handling scheme. 可以给出错误以设置不同的错误处理方案。 Default is 'strict' meaning that encoding errors raise a UnicodeEncodeError. 默认为'strict'，表示编码错误会引发UnicodeEncodeError。 Other possible values are 'ignore', 'replace' and 'xmlcharrefreplace' as well as any other name registered with codecs.register_error that can handle UnicodeEncodeErrors. 其他可能的值是'ignore'，'replace'和'xmlcharrefreplace'，以及可以处理UnicodeEncodeErrors的codecs.register_error注册的任何其他名称。

Answer 2

Use the unicode version of the translate function, assuming s is a unicode string: 使用Unicode版本translate功能，假设s是一个unicode字符串：

s.translate({ord(u"\u2019"):ord(u"'")})

The argument of the unicode version of translate is a dict mapping unicode ordinals to unicode ordinals. translate的unicode版本的参数是将unicode序数映射到unicode序数的字典。 Add to this dict other characters you cannot encode in your target encoding. 添加到此dict中您无法在目标编码中编码的其他字符。

You can build your mapping table in a little more readable form and create your mapping dict from it, for instance: 您可以用更易读的形式构建映射表，并从中创建映射表，例如：

char_mappings = [(u"\u2019", u"'"),
                 (u"`", u"'")]
translate_mapping = {ord(k):ord(v) for k,v in char_mappings}

From translate documentation: 来自翻译文档：

For Unicode objects, the translate() method does not accept the optional deletechars argument. 对于Unicode对象，translate（）方法不接受可选的deletechars参数。 Instead, it returns a copy of the s where all characters have been mapped through the given translation table which must be a mapping of Unicode ordinals to Unicode ordinals, Unicode strings or None. 相反，它返回s的副本，其中所有字符都已通过给定的转换表进行映射，该转换表必须是Unicode序列到Unicode序列，Unicode字符串或None的映射。 Unmapped characters are left untouched. 未映射的字符保持不变。 Characters mapped to None are deleted. 映射到“无”的字符将被删除。 Note, a more flexible approach is to create a custom character mapping codec using the codecs module (see encodings.cp1251 for an example). 注意，更灵活的方法是使用编解码器模块创建自定义字符映射编解码器（有关示例，请参阅encodings.cp1251）。

Answer 3

For info, my final solution: 有关信息，我的最终解决方案：

iso885915_utf_map = {
    u"\u2019":  u"'",
    u"\u2018":  u"'",
    u"\u201c":  u'"',
    u"\u201d":  u'"',
}
utf_map = dict([(ord(k), ord(v)) for k,v in iso885915_utf_map.items()])
s.translate(utf_map).encode('iso-8859-15')

Thank you for your help 谢谢您的帮助

如何在python中规范化iso-8859-15转换的unicode编码？

问题描述

3 个解决方案

解决方案1
6 2012-05-28 13:34:42

解决方案2
5 已采纳 2012-05-28 13:25:28

解决方案3
3 2012-05-28 14:30:18

如何在python中规范化iso-8859-15转换的unicode编码？

问题描述

3 个解决方案

解决方案1 6 2012-05-28 13:34:42

解决方案2 5 已采纳 2012-05-28 13:25:28

解决方案3 3 2012-05-28 14:30:18

解决方案1
6 2012-05-28 13:34:42

解决方案2
5 已采纳 2012-05-28 13:25:28

解决方案3
3 2012-05-28 14:30:18