简体   繁体   中英

Python seems to incorrectly identify case-sensitive string using regex

I'm checking for a case-sensitive string pattern using Python 2.7 and it seems to return an incorrect match. I've run the following tests:

>>> import re
>>> rex_str = "^((BOA_[0-9]{4}-[0-9]{1,3})(?:CO)?.(?i)pdf$)"
>>> not re.match(rex_str, 'BOA_1988-148.pdf')
>>> False
>>> not re.match(rex_str, 'BOA_1988-148.PDF')
>>> False
>>> not re.match(rex_str, 'BOA1988-148.pdf')
>>> True
>>> not re.match(rex_str, 'boa_1988-148.pdf')
>>> False

The first three tests are correct, but the final test, 'boa_1988-148.pdf' should return True because the pattern is supposed to treat the first 3 characters (BOA) as case-sensitive.

I checked the expression with an online tester ( https://regex101.com/ ) and the pattern was correct, flagging the final as a no match because the 'boa' was lower case. Am I missing something or do you have to explicitly declare a group as case-sensitive using a case-sensitive mode like (?c)?

Flags do not apply to portions of a regex. You told the regex engine to match case insensitively:

(?i)

From the the syntax documentation :

 (?aiLmsux) 

(One or more letters from the set 'a' , 'i' , 'L' , 'm' , 's' , 'u' , 'x' .) The group matches the empty string; the letters set the corresponding flags: re.A (ASCII-only matching), re.I (ignore case), re.L (locale dependent), re.M (multi-line), re.S (dot matches all), and re.X (verbose), for the entire regular expression . (The flags are described in Module Contents.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the re.compile() function. Flags should be used first in the expression string.

Emphasis mine, the flag applies to the whole pattern , not just a substring. If you need to match just pdf or PDF , use that in your pattern directly:

r"^((BOA_[0-9]{4}-[0-9]{1,3})(?:CO)?.(?:pdf|PDF)$)"

This matches either .pdf or .PDF . If you need to match any mix of uppercase and lowercase, use:

r"^((BOA_[0-9]{4}-[0-9]{1,3})(?:CO)?.[pP][dD][fF]$)"

(?i) doesn't only apply after itself or to the group that contains it. From the Python 2 re documentation :

(?iLmsux)

(One or more letters from the set 'i' , 'L' , 'm' , 's' , 'u' , 'x' .) The group matches the empty string; the letters set the corresponding flags […] for the entire regular expression .

One option is to do it manually:

r"^(BOA_[0-9]{4}-[0-9]{1,3})(?:CO)?\.[Pp][Dd][Ff]\Z"

Another is to use a separate case-sensitive check:

 rex_str = r"(?i)^(BOA_[0-9]{4}-[0-9]{1,3})(?:CO)?\.pdf\Z"
 match = re.match(rex_str, s) if s.startswith("BOA_") else None

or separate case-insensitive one:

 rex_str = r"^(BOA_[0-9]{4}-[0-9]{1,3})(?:CO)?\..{3}\Z"
 match = re.match(rex_str, s) if s.lower().endswith(".pdf") else None

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM