简体   繁体   中英

python regex negative look behind assertion, subsequent captured group start and end indices incorrect?

this has me scratching my head...

The following code illustrates the problem in python 2.7:

import re

test_string = "wqdwc ww w w\nwcw w wef wefwe fq\nWrite {\ndfwdfwdc\ndfdfwdf wef we\nwefwe wefwe ewf\nwefdww {{ wefwe, wefwe, wef, } \n}\nwqef wefwef qw\n}\nwef qw qfg3g q g\ng332r256\n32e5\n"

node_descr_re = re.compile('\n([A-Z]\w+\d*)\s\{.+?(?<! )\n(\})', re.DOTALL)

node_descr_match = node_descr_re.search( test_string )

node_block = node_descr_match.group()

print("---------------------------")
print("Matched string: \n{}".format( node_block ))
print("---------------------------")
print("Node block length: {}".format( len(node_block) ))
print("group(2) start index: {}".format( node_descr_match.start(2) ))
print("group(2) end index: {}".format( node_descr_match.end(2) ))
print("group(2) capture: {}".format( node_descr_match.group(2) ))

I run this and get:

---------------------------
Matched string: 

Write {
dfwdfwdc
dfdfwdf wef we
wefwe wefwe ewf
wefdww {{ wefwe, wefwe, wef, } 
}
wqef wefwef qw
}
---------------------------
Node block length: 99
group(2) start index: 129
group(2) end index: 130
group(2) capture: }

The regex correctly matches the node description in amongst the gibberish, and correctly applied the negative look behind assertion, ignoring the occurrence of a \\n and closing curly brace that has a space before them, but seizing upon the subsequent \\n and closing curly brace that do not have a space before them.

The print statements output should illustrate what is troubling me. They show that group(2) has successfully captured the closing curly brace of the node description. But although the entire length of the node description match (node block) is 98 characters, the start index of the captured final character (the closing curly brace captured by group(2)) is 128 ????

Can anyone shed some light?

EDIT - is the index referring to the position of the match in the original test_string? That seems to be the answer. Sorry for the question - I sometimes get confused about the start and end indexes of groups referring to positions within the original string rather than in the match itself

>>> test_string[129:]
'}\nwef qw qfg3g q g\ng332r256\n32e5\n'

That shows the closing curly brace does in fact appear at index 129 = node_descr_match.start(2) . Note that node_block itself doesn't start at index 0 either. It begins at 31:

>>> test_string.index("\nWrite")
31

node_block is the entire match, so spans test_string[31 : 31+99] = test_string[31 : 130] . The closing curly brace is its last character, so from this view too its index must be 130-1 = 129 .

There's no inconsistency here I can see.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM