[英]Python - Convert utf8 special characters (accented) to extended ascii equivalent
我想使用Python将utf8特殊字符(带重音等)转换为其扩展的ascii(纯粹主义者会说没有这种东西,因此这里是我的意思的链接 )等效项。
因此,基本上,我想读取UTF-8文件并写出扩展的ascii文件(例如Latin-1(我正在使用Windows),如果需要该信息的话。我已经阅读了所有的Unicode等。一句话也听不懂),但我想保留尽可能多的信息。 因此,对于UTF-8字符á,我想将其转换为扩展的ascii等效á。 我不想忽略或丢失字符,也不想使用a。 对于没有等效扩展ascii字符的字符,我只想使用我选择的字符(例如〜),尽管某些ß之类的字符如果扩展ASCII中不存在ß,则希望转换为ss。
Python 3中有什么可以做到这一点,还是可以提供一些示例代码说明我将如何做到这一点?
有谁知道列出扩展ASCII字符的utf8等效项的网站?
根据下面的注释,我提出了此代码,可悲的是,由于大多数特殊字符都以?返回,因此该代码不能很好地工作。 而不是ê(不确定原因):
# -*- coding: utf-8 -*-
f_in = open(r'E:/work/python/lyman.txt', 'rU', encoding='utf8')
raw = f_in.read()
f_out = open(r'E:/work/python/lyman_ascii.txt', 'w', encoding='cp1252', errors='replace')
retval = []
for char in raw:
codepoint = ord(char)
if codepoint < 0x80: # Basic ASCII
retval.append(str(char))
continue
elif codepoint > 0xeffff:
continue # Characters in Private Use Area and above are ignored
# ë
elif codepoint == 235:
retval.append(chr(137))
continue
# ê
elif codepoint == 234:
retval.append(chr(136))
continue
# ’
elif codepoint == 8217:
retval.append(chr(39)) # 146 gives ? for some reason
continue
else:
print(char)
print(codepoint)
print(''.join(retval))
f_out.write(''.join(retval))
这似乎可行:
# -*- coding: utf-8 -*-
import sys
# Don't use codecs in Python 3.
f_in = open(r'af_massaged.txt', 'rU', encoding='utf8')
raw = f_in.read()
f_out = open(r'af_massaged_ascii.txt', 'w', encoding='cp1252', errors='replace')
retval = []
for char in raw:
codepoint = ord(char)
if codepoint < 0x80: # Basic ASCII.
retval.append(str(char))
continue
elif codepoint > 0xeffff:
continue # Characters in Private Use Area and above are ignored.
elif codepoint >= 128 and codepoint <= 159:
continue # Ignore control characters in Latin-1.
# Don't use unichr in Python 3, chr uses unicode. Get character codes from here: https://en.wikipedia.org/wiki/List_of_Unicode_characters#Latin-1_Supplement
# This was written on Windows 7 32 bit
# For 160 to 255 Latin-1 matches unicode.
elif codepoint >= 160 and codepoint <= 255:
retval.append(str(char))
continue
# –
elif codepoint == 8211:
retval.append(chr(45))
continue
# ’
elif codepoint == 8217:
retval.append(chr(180)) # 39
continue
# “
elif codepoint == 8220:
retval.append(chr(34))
continue
# ”
elif codepoint == 8221:
retval.append(chr(34))
continue
# €
elif codepoint == 8364:
retval.append('Euro')
continue
# Find missing mappings.
else:
print(char)
print(codepoint)
# Uncomment for debugging.
#for i in range(128, 256):
# retval.append(str(i) + ': ' + chr(i) + chr(13))
#print(''.join(retval))
f_out.write(''.join(retval))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.