简体   繁体   中英

Convert full-width Unicode characters into ASCII characters

I have some string text in unicode, containing some numbers as below:

txt = '36fsdfdsf14'

However, int(txt[:2]) does not recognize the characters as number. How to change the characters to have them recognized as number?

If you actually have Unicode (or decode your byte string to Unicode) then you can normalize the data with a canonical replacement:

>>> s = u'36fsdfdsf14'
>>> s
u'\uff13\uff16fsdfdsf\uff11\uff14'
>>> import unicodedata as ud
>>> ud.normalize('NFKC',s)
u'36fsdfdsf14'

If canonical normalization changes too much for you, you can make a translation table of just the replacements you want:

#coding:utf8

repl = u'0123456789'

# Fullwidth digits are U+FF10 to U+FF19.
# This makes a lookup table from Unicode ordinal to the ASCII character equivalent.
xlat = dict(zip(range(0xff10,0xff1a),repl))

s = u'36fsdfdsf14'

print(s.translate(xlat))

Output:

36fsdfdsf14

On python 3

[int(x) for x in re.findall(r'\d+', '36fsdfdsf14')]
# [36, 14]

On python 2

[int(x) for x in re.findall(r'\d+', u'36fsdfdsf14', re.U)]
# [36, 14]

About python 2 example, notice the 'u' in front of string and re.U flag. You may convert existing str typed variable such as txt in your question to unicode as txt.decode('utf8') .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM