简体   繁体   English

在不杀死Unicode的情况下,在Python 2中编码转义字符的正确方法是什么?

[英]Which is the correct way to encode escape characters in Python 2 without killing Unicode?

I think I'm going crazy with Python's unicode strings. 我想我会疯狂使用Python的unicode字符串。 I'm trying to encode escape characters in a Unicode string without escaping actual Unicode characters. 我正在尝试编码Unicode字符串中的转义字符而不转义实际的Unicode字符。 I'm getting this: 我明白了:

In [14]: a = u"Example\n"

In [15]: b = u"Пример\n"

In [16]: print a
Example


In [17]: print b
Пример


In [18]: print a.encode('unicode_escape')
Example\n

In [19]: print b.encode('unicode_escape')
\u041f\u0440\u0438\u043c\u0435\u0440\n

while I desperately need (English example works as I want, obviously): 虽然我迫切需要(英语示例按照我的意愿工作,显然):

In [18]: print a.encode('unicode_escape')
Example\n

In [19]: print b.encode('unicode_escape')
Пример\n

What should I do, short of moving to Python 3? 如果不移植到Python 3,我该怎么办?

PS: As pointed out below, I'm actually seeking to escape control characters. PS:正如下面所指出的,我实际上是在试图逃避控制角色。 Whether I need more than just those will have to be seen. 我是否需要的不仅仅是那些必须被看到。

First let's correct the terminology. 首先让我们更正术语。 What you're trying to do is replace "control characters" with an equivalent "escape sequence". 您要做的是将“控制字符”替换为等效的“转义序列”。

I haven't been able to find any built-in method to do this, and nobody has yet posted one. 我无法找到任何内置方法来执行此操作,并且还没有人发布过。 Fortunately it's not a hard function to write. 幸运的是,写作并不是一个难题。

control_chars = [unichr(c) for c in range(0x20)] # you may extend this as required

def control_escape(s):
    chars = []
    for c in s:
        if c in control_chars:
            chars.append(c.encode('unicode_escape'))
        else:
            chars.append(c)
    return u''.join(chars)

Or the slightly less readable one-liner version: 或者稍微不那么易读的单线版本:

def control_escape2(s):
    return u''.join([c.encode('unicode_escape') if c in control_chars else c for c in s])

Backslash escaping ascii control characters in the middle of unicode data is definitely a useful thing to try to accomplish. 反斜杠在unicode数据中间转义ascii控制字符绝对是一个有用的尝试。 But it's not just escaping them, it's properly unescaping them when you want the actual character data back. 但它不仅仅是逃避它们,当你想要实际的角色数据时,它正确地取消了它们。

There should be a way to do this in the python stdlib, but there is not. 应该有一种方法可以在python stdlib中执行此操作,但没有。 I filed a bug report: http://bugs.python.org/issue18679 我提交了一份错误报告: http//bugs.python.org/issue18679

but in the mean time, here's a work around using translate and hackery: 但与此同时,这是一个使用翻译和hackery的工作:

tm = dict((k, repr(chr(k))[1:-1]) for k in range(32))
tm[0] = r'\0'
tm[7] = r'\a'
tm[8] = r'\b'
tm[11] = r'\v'
tm[12] = r'\f'
tm[ord('\\')] = '\\\\'

b = u"Пример\n"
c = b.translate(tm)
print(c) ## results in: Пример\n

All the non-backslash-single-letter control characters will be escaped with the \\x## sequence, but if you need something different done with those, your translation matrix can do that. 所有非反斜杠 - 单字母控制字符都将使用\\ x ##序列进行转义,但如果您需要使用不同的方法,您的翻译矩阵可以执行此操作。 This approach is not lossy though, so it works for me. 这种方法虽然没有损失,但它对我有用。

But getting it back out is hacky too because you can't just translate character sequences back into single characters using translate. 但是将其退出也是很麻烦的,因为你不能只使用翻译将字符序列翻译成单个字符。

d = c.encode('latin1', 'backslashreplace').decode('unicode_escape')
print(d) ## result in Пример with trailing newline character

you actually have to encode the characters that map to bytes individually using latin1 while backslash escaping unicode characters that latin1 doesn't know about so that the unicode_escape codec can handle reassembling everything the right way. 你实际上必须使用latin1编码分别映射到字节的字符,而反斜杠转义latin1不知道的unicode字符,以便unicode_escape编解码器能够以正确的方式处理重组所有内容。

UPDATE : 更新

So I had a case where I needed this to work in both python2.7 and python3.3. 所以我有一个案例,我需要这个在python2.7和python3.3中工作。 Here's what I did (buried in a _compat.py module): 这是我做的(埋在_compat.py模块中):

if isinstance(b"", str):                                                        
    byte_types = (str, bytes, bytearray)                                        
    text_types = (unicode, )                                                    
    def uton(x): return x.encode('utf-8', 'surrogateescape')                    
    def ntob(x): return x                                                       
    def ntou(x): return x.decode('utf-8', 'surrogateescape')                    
    def bton(x): return x
else:                                                                           
    byte_types = (bytes, bytearray)                                             
    text_types = (str, )                                                        
    def uton(x): return x                                                       
    def ntob(x): return x.encode('utf-8', 'surrogateescape')                    
    def ntou(x): return x                                                       
    def bton(x): return x.decode('utf-8', 'surrogateescape')    

escape_tm = dict((k, ntou(repr(chr(k))[1:-1])) for k in range(32))              
escape_tm[0] = u'\0'                                                            
escape_tm[7] = u'\a'                                                            
escape_tm[8] = u'\b'                                                            
escape_tm[11] = u'\v'                                                           
escape_tm[12] = u'\f'                                                           
escape_tm[ord('\\')] = u'\\\\'

def escape_control(s):                                                          
    if isinstance(s, text_types):                                               
        return s.translate(escape_tm)
    else:
        return s.decode('utf-8', 'surrogateescape').translate(escape_tm).encode('utf-8', 'surrogateescape')

def unescape_control(s):                                                        
    if isinstance(s, text_types):                                               
        return s.encode('latin1', 'backslashreplace').decode('unicode_escape')
    else:                                                                       
        return s.decode('utf-8', 'surrogateescape').encode('latin1', 'backslashreplace').decode('unicode_escape').encode('utf-8', 'surrogateescape')

The method .encode returns a byte-string (type str in Python 2), so it cannot return unicode characters. 方法.encode返回一个字节字符串(Python 2中的str类型),因此它不能返回unicode字符。

But as there are only few \\ - sequences you can easily .replace them manually. 但由于只有少数\\ - 序列,您可以轻松地手动.replace它们。 See http://docs.python.org/reference/lexical_analysis.html#string-literals for a complete list. 有关完整列表,请参见http://docs.python.org/reference/lexical_analysis.html#string-literals

.encode('unicode_escape') returns a byte string. .encode('unicode_escape')返回一个字节字符串。 You probably want to escape the control characters directly in the Unicode string: 您可能希望直接在Unicode字符串中转义控制字符:

# coding: utf8
import re

def esc(m):
    return u'\\x{:02x}'.format(ord(m.group(0)))

s = u'\r\t\b马克\n'

# Match control characters 0-31.
# Use DOTALL option to match end-of-line control characters as well.
print re.sub(ur'(?s)[\x00-\x1f]',esc,s)

Output: 输出:

\x0d\x09\x08马克\x0a

Note there are other Unicode control characters beyond 0-31, so you may need something more like: 请注意,除0-31之外还有其他Unicode控制字符,因此您可能需要更多类似的内容:

# coding: utf8
import re
import unicodedata as ud

def esc(m):
    c = m.group(0)
    if ud.category(c).startswith('C'):
        return u'\\u{:04x}'.format(ord(c))
    return c

s = u'\rMark\t\b马克\n'

# Match ALL characters so the replacement function
# can test the category.  Not very efficient if the string is long.
print re.sub(ur'(?s).',esc,s)

Output: 输出:

\u000dMark\u0009\u0008马克\u000a

You may want finer control of what is considered a control character. 您可能希望更好地控制被视为控制字符的内容。 There are a number of categories . 有许多类别 You could build a regular expression matching a specific type with: 您可以构建与特定类型匹配的正则表达式:

import sys
import re
import unicodedata as ud

# Generate a regular expression that matches any Cc category Unicode character.
Cc_CODES = u'(?s)[' + re.escape(u''.join(unichr(n) for n in range(sys.maxunicode+1) if ud.category(unichr(n)) == 'Cc')) + u']'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM