There is no [\\p{Ll}\\p{Lo}\\ 1 in python, and I'm struggling to write a regular expression that recognizes unicode...and doesn't confuse punctuation such as '-' or add funny diacritics when the script encounters a phonetic mark (like 'ô' or 'طس').
My goal is to label ALL letters (ASCII and any unicode) and return an "A". A number [1-9] as a 9.
My current function is:
def multiple_replace(myString):
myString = re.sub(r'(?u)[^\W\d_]|-','A', myString)
myString = re.sub(r'[0-9]', '9', myString)
return myString
The returns I am getting are (notice the incosistency in how '-' is being labeled...sometimes as an 'A' sometimes as a 'Aœ'):
TX 35-L | AA 99AA
М-21 | AAœA99
A 1 طس | A 9 A~˜A·A~AA
US-50 | AAA99
yeni sinop-erfelek yolu çevre yolu | AAAA AAAAAAAAAAAAA AAAA AƒA§AAAA AAAA
Av Antônio Ribeiro | AA AAAAƒA´AAA AAAAAAA
What I need to get is this:
TX 35-L | AA 99-A
М-21 | A-99
A 1 طس | A 9 AAAAA
US-50 | AA-99
yeni sinop-erfelek yolu çevre yolu | AAAA AAAAAAAAAAAAA AAAA AAAAAAAA AAAA
Av Antônio Ribeiro | AA AAAAAAAAAA AAAAAAA
...is it even possible (with python re 2.7) to commonly identify ALL UTF-8 characters that ARE NOT common punctuation marks (ie '()', ',', '.', '-', etc) and NOT 1-9 numbers without [\\p{Ll}\\p{Lo}\\?
If using Python 2.7, use Unicode strings. I'm assuming your "What I need" examples are incorrect, or do you really want AAAAA
for طس
? If reading the strings from a file, decode the strings to Unicode first.
#!python2
#coding: utf8
import re
# Note leading u
data = u'TX 35-L|М-21|A 1 طس|US-50|yeni sinop-erfelek yolu çevre yolu|Av Antônio Ribeiro'.split('|')
for d in data:
r = re.sub(ur'(?u)[^\W\d_]',u'A', d)
r = re.sub(ur'[0-9]', u'9', r)
print d
print r
print
Output:
TX 35-L
AA 99-A
М-21
A-99
A 1 طس
A 9 AA
US-50
AA-99
yeni sinop-erfelek yolu çevre yolu
AAAA AAAAA-AAAAAAA AAAA AAAAA AAAA
Av Antônio Ribeiro
AA AAAAAAA AAAAAAA
Not sure why my answer just got deleted, but here is what I went forth with:
function (regex):
def multiple_replace(myString):
myString = re.sub(ur'(?u)[^\W\d_]', u'A', myString)
myString = re.sub(ur'[0-9]', u'9', myString)
return myString
call (w/ decoding):
with codecs.open(r'test5.txt', 'w', 'utf-8') as outfile1:
for row in reader:
unicode_row = [x.decode('utf-8') for x in row]
item = unicode_row[csv_col_index]
outfile1.write(row[1] + "," + item + "," + multiple_replace(item) + "\n")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.