在 unicode 代码点列表中查找连续范围

Question

I have a list of unicode code points, something along these lines ( not an actual set, problem illustration only ):我有一个 unicode 代码点列表，大致如下（不是实际集合，仅问题说明）：

uni050B
uni050C
uni050D
uni050E
uni050F
uni0510
uni0511
uni0512
uni0513
uni1E00
uni1E01
uni1E3E
uni1E3F
uni1E80
uni1E81
uni1E82
uni1E83
uni1E84
uni1E85
uni1EA0
and so forth…

I need to find the unicode-range for these.我需要找到这些的unicode-range 。 Some parts of this set are continuous, with some points missing - so the range is not U+050B-1EA0 .该集合的某些部分是连续的，缺少一些点-因此范围不是U+050B-1EA0 。 Is there a sensible way of extracting those continuous "sub-ranges"?有没有一种合理的方法来提取那些连续的“子范围”？

Answer 1

I don't know of anything "off-the-shelf" but easy enough to calculate.我不知道任何“现成的”但很容易计算的东西。 Below finds consecutive numbers and builds a unicode-range using Python:下面查找连续数字并使用 Python 构建一个unicode-range ：

import re

def build_range(uni):
    '''Pass a list of sorted positive integers to include in the unicode-range.
    '''
    uni.append(-1) # sentinel prevents having to special case the last element
    start,uni = uni[0],uni[1:]
    current = start

    strings = []
    for u in uni:
        if u == current: # in case of duplicates
            continue
        if u == current + 1: # in a consecutive range...
            current = u
        elif start == current: # single element
            strings.append(f'U+{current:X}')
            start = current = u
        else: # range
            strings.append(f'U+{start:X}-{current:X}')
            start = current = u
        
    return 'unicode-range: ' + ', '.join(strings) + ';'

data = '''\
uni050B
uni050C
uni050D
uni050E
uni050F
uni0510
uni0511
uni0512
uni0513
uni1E00
uni1E01
uni1E3E
uni1E3F
uni1E80
uni1E81
uni1E82
uni1E83
uni1E84
uni1E85
uni1EA0'''

# parse out the hexadecimal values into an integer list
uni = sorted([int(x,16) for x in re.findall(r'uni([0-9A-F]{4})',data)])

print(build_range(uni))

Output: Output：

unicode-range: U+50B-513, U+1E00-1E01, U+1E3E-1E3F, U+1E80-1E85, U+1EA0;

在 unicode 代码点列表中查找连续范围

问题描述

1 个解决方案

解决方案1
2 2021-06-01 22:48:56

在 unicode 代码点列表中查找连续范围

问题描述

1 个解决方案

解决方案1 2 2021-06-01 22:48:56

解决方案1
2 2021-06-01 22:48:56