简体   繁体   中英

Generate UTF-8 character list

I have a UTF-8 file which I convert to ISO-8859-1 before sending the file to a consuming system that does not understand the UTF-8. Our current issue is that when we run the iconv process on the UTF-8 file, some characters are getting converted to '?'. Currently, for every failing character, we have been providing a fix. I am trying to understand if it is possible to create a file which has all possible UTF-8 characters? The intent is to downgrade them using iconv and identify the characters that are getting replaced with '?'

Rather than looking at every possible Unicode character (over 140k of them), I recommend performing an iconv substitution and then seeing where your actual problems are. For example:

iconv -f UTF-8 -t ISO-8859-1 --unicode-subst="<U+%04X>"

This will convert characters that aren't in ISO-8859-1 to a "<U+####>" syntax. You can then search your output for these.

If your data will be read by something that handles C-style escapes (\\u####), you can also use:

iconv -f UTF-8 -t ISO-8859-1 --unicode-subst="\\u%04x"

An exhaustive list of all Unicode characters seems rather impractical for this use case. There are tens of thousands of characters in non-Latin scripts which don't have any obvious near-equivalent in Latin-1.

Instead, probably look for a mapping from Latin characters which are not in Latin-1 to corresponding homographs or near-equivalents.

Some programming languages have existing libraries for this; a common and simple transformation is to attempt to strip any accents from characters which cannot be represented in Latin-1, and use the unaccented variant if this works. (You'll want to keep the accent for any character which can be normalized to Latin-1, though. Maybe also read about Unicode normalization.)

Here's a quick and dirty Python attempt.

from unicodedata import normalize

def latinize(string):
    """
    Map string to Latin-1, replacing characters which can be approximated
    """
    result = []
    for char in string:
        try:
            byte = normalize("NFKC", char).encode('latin-1')
        except UnicodeEncodeError:
            byte = normalize("NFKD", char).encode('ascii', 'ignore')
        result.append(byte)
    return b''.join(result)

def convert(fh):
    for line in fh:
        print(latinize(line), end='')

def main():
    import sys
    if len(sys.argv) > 1:
        for filename in sys.argv[1:]:
            with open(filename, 'r') as fh:
                convert(fh)
    else:
        convert(sys.stdin)

if __name__ == '__main__':
    main()

Demo: https://ideone.com/sOEBW9

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM