简体   繁体   中英

python3 parse string(contain '*') using regular expression

Let's say string has pattern like this (\\d+)(X|Y|Z)(!|#)?
digits appear => one of X or Y or Z appear => ! or # ! or # not always appear.

I want to parse string and want to return list.

ex1) str = 238Z!32Z#11234X
I want to return [238Z!, 32Z#, 11234X]

ex2) str = 91X92Y93Z
I want to return [91X, 92Y, 93Z]

below is my code.

# your code goes here
import re

p=re.compile('^(\d+)(X|Y|Z)(!|#)?$')
L=p.findall("238Z!32Z!11234X")
print(L)

but I got empty list [] .
what's wrong with me?

Dont use the ^ and $ in regex. ^ matches start of line, $ matches end of line. That means your regex will only match string that begins and ends a line.

import re

p=re.compile('(\d+)(X|Y|Z)(!|#)?')
L=p.findall("238Z!32Z!11234X")
print(L)

Output:

[('238', 'Z', '!'), ('32', 'Z', '!'), ('11234', 'X', '')]

If you wish to not get a tuples, but instead whole strings that were matched, don't use capturing groups:

p=re.compile('(?:\d+)(?:X|Y|Z)(?:!|#)?')

Output:

['238Z!', '32Z!', '11234X']

First, ^ and $ are metacharacters used to match the start and end of your string (not the pattern). So you have to remove them so that your regex can find all the corresponding patterns.

Second, the findall function will return a list of groups if your pattern contains at least one. Groups are defined by the parentheses in your pattern. You should use a non-capturing group (?:...) .

import re

p = re.compile('(?:\d+)(?:X|Y|Z)(?:!|#)?')
L = p.findall("238Z!32Z!11234X")
print(L)
# ['238Z!', '32Z!', '11234X']

Another advice when writing a regex. If you want to match a list of characters, you do not need (a|b|c) , you can use [abc] which has the same meaning.

Moreover, you do not need to use parentheses if you want to quantify a single element. (\\d+) is equivalent to \\d+ , and you will not have any group problem anymore.

Your regex would then become:

\d+[XYZ][!#]?

You should not use ^ or $ anchors as they will require your string to match completely with one pattern.

Also don't use capture groups if you want to get the desired result:

p=re.compile('\d+[XYZ][!#]?')

['238Z!', '32Z!', '11234X']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM