$
matches at the end of a line, which is defined as either the end of the string, or any location followed by a newline character.
However, the Windows newline flag contains two characters '\\r\\n'
, how to make '$'
recognize '\\r\\n'
as a newline character in bytes
?
Here is what I have:
# Python 3.4.2
import re
input = b'''
//today is a good day \r\n
//this is Windows newline style \r\n
//unix line style \n
...other binary data...
'''
L = re.findall(rb'//.*?$', input, flags = re.DOTALL | re.MULTILINE)
for item in L : print(item)
now the output is:
b'//today is a good day \r'
b'//this is Windows newline style \r'
b'//unix line style '
but the expected output is as follows:
the expected output:
b'//today is a good day '
b'//this is Windows newline style '
b'//unix line style '
It is not possible to redefine anchor behavior.
To match a //
with any number of characters other than CR and LF after it, use a negated character class [^\\r\\n]
with *
quantifier:
L = re.findall(rb'//[^\r\n]*', input)
Note that this approach does not require using re.M
and re.S
flags.
Or, you can add \\r?
before a $
and enclose this part in a positive look-ahead (also, you will beed a *?
lazy quantifier with .
):
rb'//.*?(?=\r?$)'
The point in using a lookahead is that $
itself is a kind of a lookahead since it does not really consume the \\n
character. Thus, we can safely put it into a look-ahead with optional \\r
.
Maybe this is not that pertinent since it is from MSDN , but I think it is the same for Python:
Note that
$
matches\\n
but does not match\\r\\n
(the combination of carriage return and newline characters, orCR/LF
). To match theCR/LF
character combination, include\\r?$
in the regular expression pattern.
In PCRE, you can use (*ANYCRLF), (*CR) and (*ANY) to override the default behavior of the $ anchor, but not in Python.
A hack, but...
re.findall(r'//.*?(?=\r|\n|(?!.))', input, re.DOTALL | re.MULTILINE)
This should replicate the behaviour of the default $
anchor (just before \\r
, \\n
or end of string).
I think you also could use \\v
vertical space which would match [\\n\\cK\\f\\r\\x85\\x{2028}\\x{2029}]
To not include it into the output use a lookahead : //.*(?=\\v|$)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.