生成 UTF-8 字符列表

Question

I have a UTF-8 file which I convert to ISO-8859-1 before sending the file to a consuming system that does not understand the UTF-8.我有一个 UTF-8 文件，在将文件发送到不理解 UTF-8 的消费系统之前，我将其转换为 ISO-8859-1。 Our current issue is that when we run the iconv process on the UTF-8 file, some characters are getting converted to '?'.我们当前的问题是，当我们在 UTF-8 文件上运行 iconv 进程时，某些字符会被转换为“?”。 Currently, for every failing character, we have been providing a fix.目前，对于每个失败的角色，我们一直在提供修复。 I am trying to understand if it is possible to create a file which has all possible UTF-8 characters?我想了解是否可以创建一个包含所有可能的 UTF-8 字符的文件？ The intent is to downgrade them using iconv and identify the characters that are getting replaced with '?'目的是使用 iconv 将它们降级并识别被替换为“？”的字符。

Answer 1

Rather than looking at every possible Unicode character (over 140k of them), I recommend performing an iconv substitution and then seeing where your actual problems are.与其查看每个可能的 Unicode 字符（超过 14 万个），我建议执行 iconv 替换，然后查看实际问题所在。 For example:例如：

iconv -f UTF-8 -t ISO-8859-1 --unicode-subst="<U+%04X>"

This will convert characters that aren't in ISO-8859-1 to a "<U+####>" syntax.这会将不在 ISO-8859-1 中的字符转换为“<U+####>”语法。 You can then search your output for these.然后，您可以在输出中搜索这些内容。

If your data will be read by something that handles C-style escapes (\\u####), you can also use:如果您的数据将由处理 C 样式转义 (\\u####) 的内容读取，您还可以使用：

iconv -f UTF-8 -t ISO-8859-1 --unicode-subst="\\u%04x"

Answer 2

An exhaustive list of all Unicode characters seems rather impractical for this use case.对于此用例，所有 Unicode 字符的详尽列表似乎相当不切实际。 There are tens of thousands of characters in non-Latin scripts which don't have any obvious near-equivalent in Latin-1.非拉丁文字中有数以万计的字符，它们在 Latin-1 中没有任何明显的近等值。

Instead, probably look for a mapping from Latin characters which are not in Latin-1 to corresponding homographs or near-equivalents.相反，可能会寻找从不是 Latin-1 的拉丁字符到相应的同形异义词或近似等价词的映射。

Some programming languages have existing libraries for this;一些编程语言为此提供了现有的库； a common and simple transformation is to attempt to strip any accents from characters which cannot be represented in Latin-1, and use the unaccented variant if this works.一个常见且简单的转换是尝试从无法用 Latin-1 表示的字符中去除任何重音符号，如果可行，则使用无重音变体。 (You'll want to keep the accent for any character which can be normalized to Latin-1, though. Maybe also read about Unicode normalization.) （不过，您需要保留可以标准化为 Latin-1 的任何字符的重音。也许还可以阅读有关 Unicode normalization 的信息。）

Here's a quick and dirty Python attempt.这是一个快速而肮脏的 Python 尝试。

from unicodedata import normalize

def latinize(string):
    """
    Map string to Latin-1, replacing characters which can be approximated
    """
    result = []
    for char in string:
        try:
            byte = normalize("NFKC", char).encode('latin-1')
        except UnicodeEncodeError:
            byte = normalize("NFKD", char).encode('ascii', 'ignore')
        result.append(byte)
    return b''.join(result)

def convert(fh):
    for line in fh:
        print(latinize(line), end='')

def main():
    import sys
    if len(sys.argv) > 1:
        for filename in sys.argv[1:]:
            with open(filename, 'r') as fh:
                convert(fh)
    else:
        convert(sys.stdin)

if __name__ == '__main__':
    main()

Demo: https://ideone.com/sOEBW9演示： https : //ideone.com/sOEBW9

生成 UTF-8 字符列表

问题描述

2 个解决方案

解决方案1
2 2021-07-15 15:10:26

解决方案2
0 2021-07-15 15:46:16

生成 UTF-8 字符列表

问题描述

2 个解决方案

解决方案1 2 2021-07-15 15:10:26

解决方案2 0 2021-07-15 15:46:16

解决方案1
2 2021-07-15 15:10:26

解决方案2
0 2021-07-15 15:46:16