简体   繁体   English

生成 UTF-8 字符列表

[英]Generate UTF-8 character list

I have a UTF-8 file which I convert to ISO-8859-1 before sending the file to a consuming system that does not understand the UTF-8.我有一个 UTF-8 文件,在将文件发送到不理解 UTF-8 的消费系统之前,我将其转换为 ISO-8859-1。 Our current issue is that when we run the iconv process on the UTF-8 file, some characters are getting converted to '?'.我们当前的问题是,当我们在 UTF-8 文件上运行 iconv 进程时,某些字符会被转换为“?”。 Currently, for every failing character, we have been providing a fix.目前,对于每个失败的角色,我们一直在提供修复。 I am trying to understand if it is possible to create a file which has all possible UTF-8 characters?我想了解是否可以创建一个包含所有可能的 UTF-8 字符的文件? The intent is to downgrade them using iconv and identify the characters that are getting replaced with '?'目的是使用 iconv 将它们降级并识别被替换为“?”的字符。

Rather than looking at every possible Unicode character (over 140k of them), I recommend performing an iconv substitution and then seeing where your actual problems are.与其查看每个可能的 Unicode 字符(超过 14 万个),我建议执行 iconv 替换,然后查看实际问题所在。 For example:例如:

iconv -f UTF-8 -t ISO-8859-1 --unicode-subst="<U+%04X>"

This will convert characters that aren't in ISO-8859-1 to a "<U+####>" syntax.这会将不在 ISO-8859-1 中的字符转换为“<U+####>”语法。 You can then search your output for these.然后,您可以在输出中搜索这些内容。

If your data will be read by something that handles C-style escapes (\\u####), you can also use:如果您的数据将由处理 C 样式转义 (\\u####) 的内容读取,您还可以使用:

iconv -f UTF-8 -t ISO-8859-1 --unicode-subst="\\u%04x"

An exhaustive list of all Unicode characters seems rather impractical for this use case.对于此用例,所有 Unicode 字符的详尽列表似乎相当不切实际。 There are tens of thousands of characters in non-Latin scripts which don't have any obvious near-equivalent in Latin-1.非拉丁文字中有数以万计的字符,它们在 Latin-1 中没有任何明显的近等值。

Instead, probably look for a mapping from Latin characters which are not in Latin-1 to corresponding homographs or near-equivalents.相反,可能会寻找从不是 Latin-1 的拉丁字符到相应的同形异义词或近似等价词的映射。

Some programming languages have existing libraries for this;一些编程语言为此提供了现有的库; a common and simple transformation is to attempt to strip any accents from characters which cannot be represented in Latin-1, and use the unaccented variant if this works.一个常见且简单的转换是尝试从无法用 Latin-1 表示的字符中去除任何重音符号,如果可行,则使用无重音变体。 (You'll want to keep the accent for any character which can be normalized to Latin-1, though. Maybe also read about Unicode normalization.) (不过,您需要保留可以标准化为 Latin-1 的任何字符的重音。也许还可以阅读有关 Unicode normalization 的信息。)

Here's a quick and dirty Python attempt.这是一个快速而肮脏的 Python 尝试。

from unicodedata import normalize

def latinize(string):
    """
    Map string to Latin-1, replacing characters which can be approximated
    """
    result = []
    for char in string:
        try:
            byte = normalize("NFKC", char).encode('latin-1')
        except UnicodeEncodeError:
            byte = normalize("NFKD", char).encode('ascii', 'ignore')
        result.append(byte)
    return b''.join(result)

def convert(fh):
    for line in fh:
        print(latinize(line), end='')

def main():
    import sys
    if len(sys.argv) > 1:
        for filename in sys.argv[1:]:
            with open(filename, 'r') as fh:
                convert(fh)
    else:
        convert(sys.stdin)

if __name__ == '__main__':
    main()

Demo: https://ideone.com/sOEBW9演示: https : //ideone.com/sOEBW9

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM