简体   繁体   English

python-re:我如何匹配alpha字符

[英]python-re: How do I match an alpha character

How can I match an alpha character with a regular expression. 如何将alpha字符与正则表达式匹配。 I want a character that is in \\w but is not in \\d . 我想要一个位于\\w但不在\\d的字符。 I want it unicode compatible that's why I cannot use [a-zA-Z] . 我希望它兼容unicode,这就是为什么我不能使用[a-zA-Z]

Your first two sentences contradict each other. 你的前两句话相互矛盾。 "in \\w but is not in \\d " includes underscore. “in \\w但不在\\d ”包括下划线。 I'm assuming from your third sentence that you don't want underscore. 我从你的第三句话中假设你不想要下划线。

Using a Venn diagram on the back of an envelope helps. 在信封背面使用维恩图有助于。 Let's look at what we DON'T want: 让我们来看看我们不想要的东西:

(1) characters that are not matched by \\w (ie don't want anything that's not alpha, digits, or underscore) => \\W (1)与\\w不匹配的字符(即不要求任何不是字母,数字或下划线的字符)=> \\W
(2) digits => \\d (2)digits => \\d
(3) underscore => _ (3)下划线=> _

So what we don't want is anything in the character class [\\W\\d_] and consequently what we do want is anything in the character class [^\\W\\d_] 所以我们不想要的是字符类[\\W\\d_]中的任何内容,因此我们想要的是字符类中的任何内容[^\\W\\d_]

Here's a simple example (Python 2.6). 这是一个简单的例子(Python 2.6)。

>>> import re
>>> rx = re.compile("[^\W\d_]+", re.UNICODE)
>>> rx.findall(u"abc_def,k9")
[u'abc', u'def', u'k']

Further exploration reveals a few quirks of this approach: 进一步的探索揭示了这种方法的一些怪癖:

>>> import unicodedata as ucd
>>> allsorts =u"\u0473\u0660\u06c9\u24e8\u4e0a\u3020\u3021"
>>> for x in allsorts:
...     print repr(x), ucd.category(x), ucd.name(x)
...
u'\u0473' Ll CYRILLIC SMALL LETTER FITA
u'\u0660' Nd ARABIC-INDIC DIGIT ZERO
u'\u06c9' Lo ARABIC LETTER KIRGHIZ YU
u'\u24e8' So CIRCLED LATIN SMALL LETTER Y
u'\u4e0a' Lo CJK UNIFIED IDEOGRAPH-4E0A
u'\u3020' So POSTAL MARK FACE
u'\u3021' Nl HANGZHOU NUMERAL ONE
>>> rx.findall(allsorts)
[u'\u0473', u'\u06c9', u'\u4e0a', u'\u3021']

U+3021 (HANGZHOU NUMERAL ONE) is treated as numeric (hence it matches \\w) but it appears that Python interprets "digit" to mean "decimal digit" (category Nd) so it doesn't match \\d U + 3021(杭州数字1)被视为数字(因此它匹配\\ w)但似乎Python将“数字”解释为“十进制数字”(类别Nd),因此它与\\ d不匹配

U+2438 (CIRCLED LATIN SMALL LETTER Y) doesn't match \\w U + 2438(圆形拉丁文小写字母Y)与\\ w不匹配

All CJK ideographs are classed as "letters" and thus match \\w 所有CJK表意文字都被归类为“字母”,因此匹配\\ w

Whether any of the above 3 points are a concern or not, that approach is the best you will get out of the re module as currently released. 无论上述3点中的任何一点是否值得关注,这种方法都是目前发布的最佳模块。 Syntax like \\p{letter} is in the future. 将来会出现像\\ p {letter}这样的语法。

What about: 关于什么:

\p{L}

You can to use this document as reference: Unicode Regular Expressions 您可以将此文档用作参考: Unicode正则表达式

EDIT: Seems Python doesn't handle Unicode expressions . 编辑:似乎Python不处理Unicode表达式 Take a look into this link: Handling Accented Characters with Python Regular Expressions -- [AZ] just isn't good enough (no longer active, link to internet archive) 看看这个链接: 使用Python正则表达式处理重音字符 - [AZ]只是不够好 (不再有效,链接到互联网档案)

Another references: 另一个参考:


For posterity, here are the examples on the blog: 对于后代,以下是博客上的示例:

import re
string = 'riché'
print string
riché

richre = re.compile('([A-z]+)')
match = richre.match(string)
print match.groups()
('rich',)

richre = re.compile('(\w+)',re.LOCALE)
match = richre.match(string)
print match.groups()
('rich',)

richre = re.compile('([é\w]+)')
match = richre.match(string)
print match.groups()
('rich\xe9',)

richre = re.compile('([\xe9\w]+)')
match = richre.match(string)
print match.groups()
('rich\xe9',)

richre = re.compile('([\xe9-\xf8\w]+)')
match = richre.match(string)
print match.groups()
('rich\xe9',)

string = 'richéñ'
match = richre.match(string)
print match.groups()
('rich\xe9\xf1',)

richre = re.compile('([\u00E9-\u00F8\w]+)')
print match.groups()
('rich\xe9\xf1',)

matched = match.group(1)
print matched
richéñ

You can use one of the following expressions to match a single letter: 您可以使用以下表达式之一来匹配单个字母:

(?![\d_])\w

or 要么

\w(?<![\d_])

Here I match for \\w , but check that [\\d_] is not matched before/after that. 在这里我匹配\\w ,但检查[\\d_]之前/之后是不匹配的。

From the docs: 来自文档:

(?!...)
Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if it’s not followed by 'Asimov'.

(?<!...)
Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length and shouldn’t contain group references. Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM