编写可识别所有unicode字母的python正则表达式

Question

There is no [\\p{Ll}\\p{Lo}\\ 1 in python, and I'm struggling to write a regular expression that recognizes unicode...and doesn't confuse punctuation such as '-' or add funny diacritics when the script encounters a phonetic mark (like 'ô' or 'طس'). python中没有[\\ p {Ll} \\ p {Lo} \\ 1 ，而且我正在努力编写可识别unicode的正则表达式...并且不会混淆标点符号（例如'-'或添加有趣的变音符号）当脚本遇到语音标记（例如“ô”或“طس”）时。

My goal is to label ALL letters (ASCII and any unicode) and return an "A". 我的目标是标记所有字母（ASCII和任何unicode）并返回“ A”。 A number [1-9] as a 9. 数字[1-9]等于9。

My current function is: 我当前的功能是：

def multiple_replace(myString):
    myString = re.sub(r'(?u)[^\W\d_]|-','A', myString)
    myString = re.sub(r'[0-9]', '9', myString)
    return myString

The returns I am getting are (notice the incosistency in how '-' is being labeled...sometimes as an 'A' sometimes as a 'Aœ'): 我得到的回报是（请注意在标记“-”的过程上有些不完善...有时被标记为“ A”，有时被标记为“Aœ”）：

TX 35-L | AA 99AA
М-21 | AAœA99
A 1 طس | A 9 A~˜A·A~AA
US-50 | AAA99
yeni sinop-erfelek yolu çevre yolu | AAAA AAAAAAAAAAAAA AAAA AƒA§AAAA AAAA
Av Antônio Ribeiro | AA AAAAƒA´AAA AAAAAAA

What I need to get is this: 我需要得到的是：

TX 35-L | AA 99-A
М-21 | A-99
A 1 طس | A 9 AAAAA
US-50 | AA-99
yeni sinop-erfelek yolu çevre yolu | AAAA AAAAAAAAAAAAA AAAA AAAAAAAA AAAA
Av Antônio Ribeiro | AA AAAAAAAAAA AAAAAAA

...is it even possible (with python re 2.7) to commonly identify ALL UTF-8 characters that ARE NOT common punctuation marks (ie '()', ',', '.', '-', etc) and NOT 1-9 numbers without [\\p{Ll}\\p{Lo}\\? ...甚至（使用python re 2.7）甚至有可能通常识别不是通用标点符号的所有UTF-8字符（即'（）'，'，'，'。'，'-'等）和NOT 1-9个没有[\\ p {Ll} \\ p {Lo} \\}的数字？

Answer 1

If using Python 2.7, use Unicode strings. 如果使用Python 2.7，请使用Unicode字符串。 I'm assuming your "What I need" examples are incorrect, or do you really want AAAAA for طس ? 我假设您的“我需要的”示例不正确，或者您真的想要AAAAA作为طس吗？ If reading the strings from a file, decode the strings to Unicode first. 如果从文件中读取字符串，请首先将字符串解码为Unicode。

#!python2
#coding: utf8
import re

# Note leading u
data = u'TX 35-L|М-21|A 1 طس|US-50|yeni sinop-erfelek yolu çevre yolu|Av Antônio Ribeiro'.split('|')

for d in data:
    r = re.sub(ur'(?u)[^\W\d_]',u'A', d)
    r = re.sub(ur'[0-9]', u'9', r)
    print d
    print r
    print

Output: 输出：

TX 35-L
AA 99-A

М-21
A-99

A 1 طس
A 9 AA

US-50
AA-99

yeni sinop-erfelek yolu çevre yolu
AAAA AAAAA-AAAAAAA AAAA AAAAA AAAA

Av Antônio Ribeiro
AA AAAAAAA AAAAAAA

Answer 2

Not sure why my answer just got deleted, but here is what I went forth with: 不知道为什么我的答案刚刚被删除，但是这就是我的想法：

function (regex): 函数（正则表达式）：

def multiple_replace(myString):
    myString = re.sub(ur'(?u)[^\W\d_]', u'A', myString)
    myString = re.sub(ur'[0-9]', u'9', myString)
    return myString

call (w/ decoding): 通话（带解码）：

with codecs.open(r'test5.txt', 'w', 'utf-8') as outfile1:
    for row in reader:
        unicode_row = [x.decode('utf-8') for x in row]
        item = unicode_row[csv_col_index]
        outfile1.write(row[1] + "," + item + "," + multiple_replace(item) + "\n")

编写可识别所有unicode字母的python正则表达式

问题描述

2 个解决方案

解决方案1
2 2013-11-22 07:30:32

解决方案2
-2 已采纳 2013-11-25 03:24:04

编写可识别所有unicode字母的python正则表达式

问题描述

2 个解决方案

解决方案1 2 2013-11-22 07:30:32

解决方案2 -2 已采纳 2013-11-25 03:24:04

解决方案1
2 2013-11-22 07:30:32

解决方案2
-2 已采纳 2013-11-25 03:24:04