简体   繁体   中英

$ Windows newline symbol in Python bytes regex

$ matches at the end of a line, which is defined as either the end of the string, or any location followed by a newline character.

However, the Windows newline flag contains two characters '\\r\\n' , how to make '$' recognize '\\r\\n' as a newline character in bytes ?

Here is what I have:

# Python 3.4.2
import re

input = b'''
//today is a good day \r\n
//this is Windows newline style \r\n
//unix line style \n
...other binary data... 
'''

L = re.findall(rb'//.*?$', input, flags = re.DOTALL | re.MULTILINE)
for item in L : print(item)

now the output is:

b'//today is a good day \r'
b'//this is Windows newline style \r'
b'//unix line style '

but the expected output is as follows:

the expected output:
b'//today is a good day '
b'//this is Windows newline style '
b'//unix line style '

It is not possible to redefine anchor behavior.

To match a // with any number of characters other than CR and LF after it, use a negated character class [^\\r\\n] with * quantifier:

L = re.findall(rb'//[^\r\n]*', input)

Note that this approach does not require using re.M and re.S flags.

Or, you can add \\r? before a $ and enclose this part in a positive look-ahead (also, you will beed a *? lazy quantifier with . ):

rb'//.*?(?=\r?$)'

The point in using a lookahead is that $ itself is a kind of a lookahead since it does not really consume the \\n character. Thus, we can safely put it into a look-ahead with optional \\r .

Maybe this is not that pertinent since it is from MSDN , but I think it is the same for Python:

Note that $ matches \\n but does not match \\r\\n (the combination of carriage return and newline characters, or CR/LF ). To match the CR/LF character combination, include \\r?$ in the regular expression pattern.

In PCRE, you can use (*ANYCRLF), (*CR) and (*ANY) to override the default behavior of the $ anchor, but not in Python.

A hack, but...

re.findall(r'//.*?(?=\r|\n|(?!.))', input, re.DOTALL | re.MULTILINE)

This should replicate the behaviour of the default $ anchor (just before \\r , \\n or end of string).

I think you also could use \\v vertical space which would match [\\n\\cK\\f\\r\\x85\\x{2028}\\x{2029}]

To not include it into the output use a lookahead : //.*(?=\\v|$)

Test at regex101.com

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM