Python-将utf8特殊字符（带重音符号）转换为等效的扩展ascii

Question

I would like to use Python to convert utf8 special characters (accented, etc) to their extended ascii (purists are going to say there isn't such a thing, so here is a link to what I mean) equivalent. 我想使用Python将utf8特殊字符（带重音等）转换为其扩展的ascii（纯粹主义者会说没有这种东西，因此这里是我的意思的链接）等效项。

So basically I want to read in a UTF-8 file and write out an extended ascii file (something like Latin-1 (I'm using windows) if that information is needed. I have read all the Unicode, etc. blogs and still don't understand a word of it), but I want to preserve as much of the information as possible. 因此，基本上，我想读取UTF-8文件并写出扩展的ascii文件（例如Latin-1（我正在使用Windows），如果需要该信息的话。我已经阅读了所有的Unicode等。一句话也听不懂），但我想保留尽可能多的信息。 So for the UTF-8 character á I would like to convert it to the extended ascii equivalent á. 因此，对于UTF-8字符á，我想将其转换为扩展的ascii等效á。 I don't want to ignore or loose the character and I don't want to use an a. 我不想忽略或丢失字符，也不想使用a。 For characters where there is no equivalent extended ascii character I would just like to use a character of my choice such as ~, although some characters like ß I would like to convert to ss if there does not exist a ß in extended ascii. 对于没有等效扩展ascii字符的字符，我只想使用我选择的字符（例如〜），尽管某些ß之类的字符如果扩展ASCII中不存在ß，则希望转换为ss。

Is there anything in Python 3 that can do this or can you give some example code of how I would do this? Python 3中有什么可以做到这一点，还是可以提供一些示例代码说明我将如何做到这一点？

Does anyone know of any site that lists the utf8 equivalents for the extended ascii characters? 有谁知道列出扩展ASCII字符的utf8等效项的网站？

Based on the comments below I have come up with this code, which sadly does not work very well since most of the special characters are returned as ? 根据下面的注释，我提出了此代码，可悲的是，由于大多数特殊字符都以？返回，因此该代码不能很好地工作。 instead of ê (not sure why): 而不是ê（不确定原因）：

# -*- coding: utf-8 -*-

f_in = open(r'E:/work/python/lyman.txt', 'rU', encoding='utf8')
raw = f_in.read()

f_out = open(r'E:/work/python/lyman_ascii.txt', 'w', encoding='cp1252', errors='replace')

retval = []
for char in raw:
    codepoint = ord(char)
    if codepoint < 0x80: # Basic ASCII
        retval.append(str(char))
        continue
    elif codepoint > 0xeffff:
        continue # Characters in Private Use Area and above are ignored
    # ë
    elif codepoint == 235:
        retval.append(chr(137))
        continue
    # ê
    elif codepoint == 234:
        retval.append(chr(136))
        continue
    # ’
    elif codepoint == 8217:
        retval.append(chr(39)) # 146 gives ? for some reason
        continue
    else:
        print(char)
        print(codepoint)

print(''.join(retval))
f_out.write(''.join(retval))

Answer 1

This seems to work: 这似乎可行：

# -*- coding: utf-8 -*-
import sys

# Don't use codecs in Python 3.
f_in = open(r'af_massaged.txt', 'rU', encoding='utf8')
raw = f_in.read()

f_out = open(r'af_massaged_ascii.txt', 'w', encoding='cp1252', errors='replace')

retval = []
for char in raw:
    codepoint = ord(char)
    if codepoint < 0x80:    # Basic ASCII.
        retval.append(str(char))
        continue
    elif codepoint > 0xeffff:
        continue    # Characters in Private Use Area and above are ignored.
    elif codepoint >= 128 and codepoint <= 159:
        continue    # Ignore control characters in Latin-1.
    # Don't use unichr in Python 3, chr uses unicode. Get character codes from here: https://en.wikipedia.org/wiki/List_of_Unicode_characters#Latin-1_Supplement
    # This was written on Windows 7 32 bit
    # For 160 to 255 Latin-1 matches unicode.
    elif codepoint >= 160 and codepoint <= 255:
        retval.append(str(char))
        continue
    # –
    elif codepoint == 8211:
        retval.append(chr(45))
        continue
    # ’
    elif codepoint == 8217:
        retval.append(chr(180)) # 39
        continue
    # “
    elif codepoint == 8220:
        retval.append(chr(34))
        continue
    # ”
    elif codepoint == 8221:
        retval.append(chr(34))
        continue
    # €
    elif codepoint == 8364:
        retval.append('Euro')
        continue
    # Find missing mappings.
    else:
        print(char)
        print(codepoint)

# Uncomment for debugging.
#for i in range(128, 256):
#    retval.append(str(i) + ': ' + chr(i) + chr(13))

#print(''.join(retval))
f_out.write(''.join(retval))

Python-将utf8特殊字符（带重音符号）转换为等效的扩展ascii

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-06-15 11:16:51

Python-将utf8特殊字符（带重音符号）转换为等效的扩展ascii

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-06-15 11:16:51

解决方案1
0 已采纳 2018-06-15 11:16:51