In the code below:
>>> pattern = re.compile(r'^<HTML>')
>>> pattern.match("<HTML>")
<_sre.SRE_Match at 0x1043bc8b8>
>>> pattern.match("⇢ ⇢ <HTML>", 2) # ⇢ stands for whitespace character.
None
When we are using ^ metacharacter and matching pattern then any whitespace character at the beginning as given below doesn't give a match even if the
'pos' argument is equal to 2, and the reason being given was that the metacharacter ^ couldn't be matched in such cases( < is at position 2, and it cannot be matched with ^).
>>> pattern = re.compile(r'<HTML>$')
>>> pattern.match("<HTML>⇢", 0,6) # ⇢ stands for whitespace character.
<_sre.SRE_Match object at 0x1007033d8>
>>> pattern.match("<HTML>⇢"[:6])
<_sre.SRE_Match object at 0x100703370>
But, when we are using $ at the end of regular expression and giving the 'end' argument there is a match? Why the difference?
You'd have to dig a little into the docs, but the answer lies there. You will find the following information in the docs for pattern.search
, the same description applies to pattern.match
as well.
The optional second parameter pos gives an index in the string where the search is to start; it defaults to 0. This is not completely equivalent to slicing the string; the
'^'
pattern character matches at the real beginning of the string and at positions just after a newline, but not necessarily at the index where the search is to start.
So, this means the SOL anchor ^
will match from the true beginning of the string (and not from the position dictated by pos
. OTOH,
The optional parameter endpos limits how far the string will be searched; it will be as if the string is endpos characters long, so only the characters from
pos
toendpos - 1
will be searched for a match.
Emphasis mine. Meaning that a pattern with the EOL anchor ^
will actually match upto endpos
only (unlike pos
).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.