I would like to use Python to convert utf8 special characters (accented, etc) to their extended ascii (purists are going to say there isn't such a thing, so here is a link to what I mean) equivalent.
So basically I want to read in a UTF-8 file and write out an extended ascii file (something like Latin-1 (I'm using windows) if that information is needed. I have read all the Unicode, etc. blogs and still don't understand a word of it), but I want to preserve as much of the information as possible. So for the UTF-8 character á I would like to convert it to the extended ascii equivalent á. I don't want to ignore or loose the character and I don't want to use an a. For characters where there is no equivalent extended ascii character I would just like to use a character of my choice such as ~, although some characters like ß I would like to convert to ss if there does not exist a ß in extended ascii.
Is there anything in Python 3 that can do this or can you give some example code of how I would do this?
Does anyone know of any site that lists the utf8 equivalents for the extended ascii characters?
Based on the comments below I have come up with this code, which sadly does not work very well since most of the special characters are returned as ? instead of ê (not sure why):
# -*- coding: utf-8 -*-
f_in = open(r'E:/work/python/lyman.txt', 'rU', encoding='utf8')
raw = f_in.read()
f_out = open(r'E:/work/python/lyman_ascii.txt', 'w', encoding='cp1252', errors='replace')
retval = []
for char in raw:
codepoint = ord(char)
if codepoint < 0x80: # Basic ASCII
retval.append(str(char))
continue
elif codepoint > 0xeffff:
continue # Characters in Private Use Area and above are ignored
# ë
elif codepoint == 235:
retval.append(chr(137))
continue
# ê
elif codepoint == 234:
retval.append(chr(136))
continue
# ’
elif codepoint == 8217:
retval.append(chr(39)) # 146 gives ? for some reason
continue
else:
print(char)
print(codepoint)
print(''.join(retval))
f_out.write(''.join(retval))
This seems to work:
# -*- coding: utf-8 -*-
import sys
# Don't use codecs in Python 3.
f_in = open(r'af_massaged.txt', 'rU', encoding='utf8')
raw = f_in.read()
f_out = open(r'af_massaged_ascii.txt', 'w', encoding='cp1252', errors='replace')
retval = []
for char in raw:
codepoint = ord(char)
if codepoint < 0x80: # Basic ASCII.
retval.append(str(char))
continue
elif codepoint > 0xeffff:
continue # Characters in Private Use Area and above are ignored.
elif codepoint >= 128 and codepoint <= 159:
continue # Ignore control characters in Latin-1.
# Don't use unichr in Python 3, chr uses unicode. Get character codes from here: https://en.wikipedia.org/wiki/List_of_Unicode_characters#Latin-1_Supplement
# This was written on Windows 7 32 bit
# For 160 to 255 Latin-1 matches unicode.
elif codepoint >= 160 and codepoint <= 255:
retval.append(str(char))
continue
# –
elif codepoint == 8211:
retval.append(chr(45))
continue
# ’
elif codepoint == 8217:
retval.append(chr(180)) # 39
continue
# “
elif codepoint == 8220:
retval.append(chr(34))
continue
# ”
elif codepoint == 8221:
retval.append(chr(34))
continue
# €
elif codepoint == 8364:
retval.append('Euro')
continue
# Find missing mappings.
else:
print(char)
print(codepoint)
# Uncomment for debugging.
#for i in range(128, 256):
# retval.append(str(i) + ': ' + chr(i) + chr(13))
#print(''.join(retval))
f_out.write(''.join(retval))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.