简体   繁体   中英

regEx: To match two groups of chars

I want a regEx to match some text that contains both alpha and numeric chars. But I do NOT want it to match only alpha or numbers. Eg in python:

s = '[mytaskid: 3fee46d2]: STARTED at processing job number 10022001'
#               ^^^^^^^^ <- I want something that'll only match this part.
import re
rr = re.compile('([0-9a-z]{8})')
print 'sub=', rr.sub('########', s)
print 'findall=', rr.findall(s)

generates following output:

sub= [########: ########]: STARTED at ########ng job number ########
findall= ['mytaskid', '3fee46d2', 'processi', '10022001']

I want it to be:

sub= [mytaskid: ########]: STARTED at processing job number 10022001
findall= ['3fee46d2']

Any ideas... ?? In this case it's exactly 8 chars always, it would be even more wonderful to have a regEx that doesn't have {8} in it, ie it can match even if there are more or less than 8 chars.

-- edit --

Question is more to understand if there is a way to write a regEx such that I can combine 2 patterns (in this case [0-9] and [az] ) and ensure the matched string matches both patterns, but number of chars matched from each set is variable. Eg s could also be

s = 'mytaskid 3fee46d2 STARTED processing job number 10022001'

-- answer --

Thanks to all for the answers, all them give me what I want, so everyone gets a +1 and the first one to answer gets the accepted answer. Although jerry explains it the best. :)

If anyone is a stickler for performance, there is nothing to choose from, they're all the same.

s = '[mytaskid: 3fee46d2]: STARTED at processing job number 10022001'
#               ^^^^^^^^ <- I want something that'll only match this part.
def testIt(regEx):
    from timeit import timeit
    s = '[mytaskid: 3333fe46d2]: STARTED at processing job number 10022001'
    assert (re.sub('\\b(?=[a-z0-9]*[0-9])[a-z0-9]*[a-z][a-z0-9]*\\b', '########', s) ==
            '[mytaskid: ########]: STARTED at processing job number 10022001'), '"%s" does not work.' % regEx
    print 'sub() with \'', regEx, '\': ', timeit('rr.sub(\'########\', s)', number=500000, setup='''
import re
s = '%s'
rr = re.compile('%s')
''' % (s, regEx)
    )
    print 'findall() with \'', regEx, '\': ', timeit('rr.findall(s)', setup='''
import re
s = '%s'
rr = re.compile('%s')
''' % (s, regEx)
    )

testIt('\\b[0-9a-z]*(?:[a-z][0-9]|[0-9][a-z])[0-9a-z]*\\b')
testIt('\\b[a-z\d]*(?:\d[a-z]|[a-z]\d)[a-z\d]*\\b')
testIt('\\b(?=[a-z0-9]*[0-9])[a-z0-9]*[a-z][a-z0-9]*\\b')
testIt('\\b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\\b')

produced:

sub() with ' \b[0-9a-z]*(?:[a-z][0-9]|[0-9][a-z])[0-9a-z]*\b ':  0.328042736387
findall() with ' \b[0-9a-z]*(?:[a-z][0-9]|[0-9][a-z])[0-9a-z]*\b ':  0.350668751542
sub() with ' \b[a-z\d]*(?:\d[a-z]|[a-z]\d)[a-z\d]*\b ':  0.314759661193
findall() with ' \b[a-z\d]*(?:\d[a-z]|[a-z]\d)[a-z\d]*\b ':  0.35618526928
sub() with ' \b(?=[a-z0-9]*[0-9])[a-z0-9]*[a-z][a-z0-9]*\b ':  0.322802906619
findall() with ' \b(?=[a-z0-9]*[0-9])[a-z0-9]*[a-z][a-z0-9]*\b ':  0.35330467656
sub() with ' \b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\b ':  0.320779061371
findall() with ' \b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\b ':  0.347522144274

Try following regex:

\b[0-9a-z]*(?:[a-z][0-9]|[0-9][a-z])[0-9a-z]*\b

This will match a word containing a digit followed an alphabet or vice versa.

Hence it will cover a complete set of those words which contain at-least one digit and one alphabet.

Note : Although it is not the case with python, I have observed that not all varieties of tools support lookahead and lookbehind . So I prefer to avoid them if possible.

You need to use the look ahead (?=...) .

This one matches all words with at least one out of [123] and [abc].

>>> re.findall('\\b(?=[abc321]*[321])[abc321]*[abc][abc321]*\\b', '  123abc 123 abc')
['123abc']

This way you can do AND for constraints to the same string.

>>> help(re) 
(?=...)  Matches if ... matches next, but doesn't consume the string.

An other way is to ground it and to say: with one of [abc] and one of [123] means there is at least a [123][abc] or a [abc][123] in the string resulting in

>>> re.findall('\\b[abc321]*(?:[abc][123]|[123][abc])[abc321]*\\b', '  123abc 123 abc')
['123abc']

不是最美丽的正则表达式,但它有效:

\b[a-z\d]*(?:\d[a-z]|[a-z]\d)[a-z\d]*\b

If the format is the same each time, that is:

[########: ########]: STARTED at ########ng job number ########

You can use:

([^\]\s]+)\]

With re.findall , or re.search and getting .group(1) if you use re.search .

[^\\]\\s]+ is a negated class and will match any character except space (and family) or closing square bracket.

The regex basically looks for characters (except ] or spaces) up until a closing square bracket.


If you want to match any string containing both alpha and numeric characters, you will need a lookahead:

\b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\b

Used like so:

result = re.search(r'\b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\b', text, re.I)

re.I is for ignorecase.

\\b is a word boundary and will match only between a 'word' character and a 'non-word' character (or start/end of string).

(?=[0-9]*[az]) is a positive lookahead and makes sure there's at least 1 alpha in the part to be matched.

(?=[az]*[0-9]) is a similar lookahead but checks for digits.

You can use more specific regular expression and skip the findall.

import re
s = '[mytaskid: 3fee46d2]: STARTED at processing job number 10022001'
mo = re.search(':\s+(\w+)', s)
print mo.group(1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM