简体   繁体   中英

regex wont separate last string

I made a regex that should be able to separate specific order of numbers from a html file, but it just doesnt work in the last part. So this is how the html file prints out:

0430\n
0500 20 40 53\n
0606 19 32 45 58\n
0711 22 33 44 55 \n
...
2000 20 40\n
2100 20 40\n
2200 20 40\n
2300 20 40\n
0000\n
\n

and this is my regex:

timeRegex = re.compile(r'''((\d\d)(\d\d)
(\n|(\s
(\d\d)
\s?
(\d\d)?
\s?
(\d\d)?
\s?
(\d\d)?
\s?
(\d\d)?
)\n)?
)''',re.VERBOSE|re.DOTALL)

when looking at the list it works fine for the most part, until the last element in the list where it picks up the 0000 so it looks like this '2300 20 40\\n0000\\n\\n' Please help out.

When it gets to this part of the input:

2300 20 40\n
0000\n

It matches as follows:

  • (\\d\\d)(\\d\\d) matches 2300
  • \\s matches the space
  • (\\d\\d) matches 20
  • \\s? matches the space
  • (\\d\\d)? matches 40
  • \\s? matches the newline
  • (\\d\\d)? matches 00
  • \\s? matches nothing, since it's optional
  • (\\d\\d)? matches 00
  • \\s? (\\d\\d)? matches nothing, since they're both optional
  • \\n matches the newline

I suspect you didn't realize that \\s matches any kind of whitespace, including newlines. If you want to match a space literally in a verbose regexp, write a space preceded by backslash. So most of those \\s? should be \\ ? .

The reason is twofold:

  1. \\s matches all whitespaces, newlines as well as spaces;
  2. as @WiktorStribiżew has already said, \\s? matches zero whitespaces, too.

So what happens is one of your \\s? s eats the newline after the line 2300 20 40 , and the next \\s? matches the missing whitespace in the middle of 0000 . You don't see the problem happening in other places because you have one less \\s?(\\d\\d)? to cover two full lines; add one more to the regex and you will see the lines

2000 20 40\n
2100 20 40\n

imploded too.

I am not sure how you would like to parse this file, but judging from your code line-by-line. If so, " explicit is better than implicit ":

time_regex = re.compile(r'^(\d{4})(\s\d{2})*$')
with open(...) as inf:
    for line in inf:
        m = time_regex.match(line)
        # Use m.group(1) and m.group(2).split()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM