简体   繁体   中英

Regex python findall issue

From the test string:

 test=text-AB123-12a
 test=text-AB123a

I have to extract only 'AB123-12' and 'AB123' , but:

 re.findall("[A-Z]{0,9}\d{0,5}(?:-\d{0,2}a)?", test)

returns:

['', '', '', '', '', '', '', 'AB123-12a', '']

What are all these extra empty spaces? How do I remove them?

The quantifier {0,n} will match anywhere from 0 to n occurrences of the preceding pattern. Since the two patterns you match allow 0 occurrences, and the third is optional ( ? ) it will match 0-length strings, ie every character in your string.

Editing to find a minimum of one and maximum of 9 and 5 for each pattern yields correct results:

>>> test='text-AB123-12a'
>>> import re
>>> re.findall("[A-Z]{1,9}\d{1,5}(?:-\d{0,2}a)?", test)
['AB123-12a']

Without further detail about what exactly the strings you are matching look like, I can't give a better answer.

Your pattern is set to match zero length characters with the lower limits of your character set quantifier set to 0. Simply setting to 1 will produce the results you want:

>>> import re
>>> test = ''' test=text-AB123-12a
...  test=text-AB123a'''
>>> re.findall("[A-Z]{1,9}\d{1,5}(?:-\d{0,2}a)?", test)
['AB123-12a', 'AB123']

RegEx tester: http://www.regexpal.com/ says that your pattern string [AZ]{0,9}\\d{0,5}(?:-\\d{0,2}a)? can match 0 characters, and therefore matches infinitely.

Check your expression one more time. Python gives you undefined result.

Since all parts of your pattern are optional (your ranges specify zero to N occurences and you are qualifying the group with ? ), each position in the string counts as a match and most of those are empty matches.

How to prevent this from happening depends on the exact format of what you are trying to match. Are all those parts of your match really optional?

Since letters or digits are optional at the beginning, you must ensure that there's at least one letter or one digit, otherwise your pattern will match the empty string at each position in the string. You can do it starting your pattern with a lookahead. Example:

re.findall(r'(?=[A-Z0-9])[A-Z]{0,9}\d{0,5}(?:-\d\d?)?(?=a)', test)

In this way the match can start with a letter or with a digit.

I assume that when there's an hyphen, it is followed by at least one digit (otherwise what is the reason of this hyphen?). In other words, I assume that -a isn't possible at the end. (correct me if I'm wrong.)

To exclude the "a" from the match result, I putted it in a lookahead.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM