简体   繁体   中英

How can I determine a Unicode character from its name in Python, even if that character is a control character?

I'd like to create an array of the Unicode code points which constitute white space in JavaScript (minus the Unicode-white-space code points, which I address separately). These characters are horizontal tab, vertical tab, form feed, space, non-breaking space, and BOM. I could do this with magic numbers:

whitespace = [0x9, 0xb, 0xc, 0x20, 0xa0, 0xfeff]

That's a little bit obscure; names would be better. The unicodedata.lookup method passed through ord helps some:

>>> ord(unicodedata.lookup("NO-BREAK SPACE"))
160

But this doesn't work for 0x9, 0xb, or 0xc -- I think because they're control characters, and the "names" FORM FEED and such are just alias names. Is there any way to map these "names" to the characters, or their code points, in standard Python? Or am I out of luck?

Kerrek SB's comment is a good one: just put the names in a comment.

BTW, Python also supports a named unicode literal:

>>> u"\N{NO-BREAK SPACE}"
u'\xa0'

But it uses the same unicode name database, and the control characters are not in it.

You could roll your own "database" for the control characters by parsing a few lines of the UCD files in the Unicode public directory . In particular, see the UnicodeData-6.1.0d3 file (or see the parent directory for earlier versions).

I don't think it can be done in standard Python. The unicodedata module uses the UnicodeData.txt v5.2.0 Unicode database. Notice that the control characters are all assigned the name <control> (the second field, semicolon-delimited).

The script Tools/unicode/makeunicodedata.py in the Python source distribution is used to generate the table used by the Python runtime. The makeunicodename function looks like this:

def makeunicodename(unicode, trace):

    FILE = "Modules/unicodename_db.h"

    print "--- Preparing", FILE, "..."

    # collect names
    names = [None] * len(unicode.chars)

    for char in unicode.chars:
        record = unicode.table[char]
        if record:
            name = record[1].strip()
            if name and name[0] != "<":
                names[char] = name + chr(0)
    ...

Notice that it skips over entries whose name begins with "<" . Hence, there is no name that can be passed to unicodedata.lookup that will give you back one of those control characters.

Just hardcode the code points for horizontal tab, line feed, and carriage return, and leave a descriptive comment. As the Zen of Python goes, "practicality beats purity".

A few points:

(1) "BOM" is not a character. BOM is a byte sequence that appears at the start of a file to indicate the byte order of a file that is encoded in UTF-nn. BOM is u'\'.encode('UTF-nn'). Reading a file with the appropriate codec will slurp up the BOM; you don't see it as a Unicode character. A BOM is not data. If you do see u'\' in your data, treat it as a (deprecated) ZERO-WIDTH NO-BREAK SPACE.

(2) "minus the Unicode-white-space code points, which I address separately"?? Isn't NO-BREAK SPACE a "Unicode-white-space" code point?

(3) Your Python appears to be broken; mine does this:

>>> ord(unicodedata.lookup("NO-BREAK SPACE"))
160

(4) You could use escape sequences for the first three.

>>> map(hex, map(ord, "\t\v\f"))
['0x9', '0xb', '0xc']

(5) You could use " " for the fourth one.

(6) Even if you could use names, the readers of your code would still be applying blind faith that eg "FORM FEED" is a whitespace character.

(7) What happened to to \\r and \\n ?

Assuming you're working with Unicode strings, the first five items in your list, plus all other Unicode space characters, will be matched by the \\s option when using a regular expression. Using Python 3.1.2:

>>> import re
>>> s = '\u0009,\u000b,\u000c,\u0020,\u00a0,\ufeff'
>>> s
'\t,\x0b,\x0c, ,\xa0,\ufeff'
>>> re.findall(r'\s', s)
['\t', '\x0b', '\x0c', ' ', '\xa0']

And as for the byte-order mark, the one given can be referred to as codecs.BOM_BE or codecs.BOM_UTF16_BE (though in Python 3+, it's returned as a bytes object rather than str ).

The official Unicode recommendation for newlines may or may not be at odds with the way the Python codecs module handles newlines. Since u'\\n' is often said to mean "new line", one might expect based on this recommendation for the Python string u'\\n' to represent character U+2028 LINE SEPARATOR and to be encoded as such, rather than as the semantic-less control character U+000A . But I can only imagine the confusion that would result if the codecs module actually implemented that policy, and there are valid counter-arguments besides. Ditto for horizontal/vertical tab and form feed, which are probably not really characters but controls anyway. (I would certainly consider backspace to be a control, not a character.)

Your question seems to assume that treating U+000A as a control character (instead of a line separator) is wrong; but that is not at all certain. Perhaps it is more wrong for text processing applications everywhere to assume that a legacy printer-platen-scrolling control signal is really a true "line separator".

You can extend the lookup function to handle the characters that aren't included.

def unicode_lookup(x):
    try:
        ch = unicodedata.lookup(x)
    except KeyError:
        control_chars = {'LINE FEED':unichr(0x0a),'CARRIAGE RETURN':unichr(0x0d)}
        if x in control_chars:
            ch = control_chars[x]
        else:
            raise
    return ch

>>> unicode_lookup('SPACE')
u' '
>>> unicode_lookup('LINE FEED')
u'\n'
>>> unicode_lookup('FORM FEED')

Traceback (most recent call last):
  File "<pyshell#17>", line 1, in <module>
    unicode_lookup('FORM FEED')
  File "<pyshell#13>", line 3, in unicode_lookup
    ch = unicodedata.lookup(x)
KeyError: "undefined character name 'FORM FEED'"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM