简体   繁体   中英

Is there a library for Python that gives the script name for a given unicode character or string?

Is there a library that tells what script a particular unicode character belongs to?

For example for the input "u'ሕ'" it should return Ethiopic or similar.

Maybe the data in the unicodedata module is what you are looking for:

print unicodedata.name(u"ሕ")

prints

ETHIOPIC SYLLABLE HHE

The printed name can be used to look up the corresponding character:

unicodedata.lookup("ETHIOPIC SYLLABLE HHE")

You can parse the Scripts.txt file:

# -*- coding: utf-8; -*-

import bisect

script_file = "/path/to/Scripts.txt"
scripts = []

with open(script_file, "rt") as stream:
    for line in stream:
        line = line.split("#", 1)[0].strip()
        if line:
            rng, script = line.split(";", 1)
            elems = rng.split("..", 1)
            start = int(elems[0], 16)
            if len(elems) == 2:
                stop = int(elems[1], 16)
            else:
                stop = start
            scripts.append((start, stop, script.lstrip()))

scripts.sort()
indices = [elem[0] for elem in scripts]

def find_script(char):
    if not isinstance(char, int):
        char = ord(char)
    index = bisect.bisect(indices, char) - 1
    start, stop, script = scripts[index]
    if start <= char <= stop:
        return script
    else:
        return "Unknown"

print find_script(u'A')
print find_script(u'Д')
print find_script(u'ሕ')
print find_script(0x1000)
print find_script(0xE007F)
print find_script(0xE0080)

Note that is code is neither robust nor optimized. You should test whether the argument denotes a valid character or code point, and you should coalesce adjacent equivalent ranges.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM