简体   繁体   中英

javascript regular expression with multiple parentheses

I am trying to write a javascript regex only matching NASM-style comments in HTML. For example, matching "; interrupt" for "INT 21h ; interrupt" .

You may know /;.*/ can't be the answer because there can be a HTML entity before the comment; I thought /(?:[^&]|&.+;)*(;.*)$/ should work for it, but I found it has two problems:

  1. "      ; hello world".match(/(?:[^&]|&.+;)*(;.*)$/) is an array ["      ; hello world", "; hello world"] . I don't want an array.
  2. "      ; hello world; a message".match(/(?:[^&]|&.+;)*(;.*)$/) is ["      ; hello world; a message", "; a message"] ; even worse the second element.

Question:

  1. Why is (?:) block returned?
  2. Why "; a message" , not "; hello world; a message" ?
  3. What's the right regex I can use?

1) The (?:) is not being returned. What you are seeing is that the .match() method will always return an array: The first element is the whole match, and the following elements (if any) are the back-references. In this case, you have one back-reference, so the array contains two items.

2) Because of the first half of your regex:

(?:[^&]|&.+;)*

This is not a good idea! This will match just about anything , even including new lines! In fact, the only thing it won't match is a "&" that is not followed by a ";" on the same line. Thus, it is matching everything up to the last ";" in each of your lines.

3) I'm not at all familiar with MASM-style comments in HTML, so I'd need to see a more extensive list of what you want matched/not matched in order to confidently give a good answer here.

But here's something I've thrown together very quickly, to at least solve the two examples you gave above:

.*&.*?;\s(;.*)$

ad 1.) the ?: block is not returned. instead, the complete match is returned in the first array element. this behavior follows the specification for non-global matching (ie. without g option).

ad 2.) the first part of your regex ( (?:[^&]|&.+;)* ) matches too much. in fact it would match the complete line if you dropped the second portion. in plain english you asked to match a sequence of & followed by as many characters as possible followed by a ; , or any symbol other than & , respectively, and you ask the engine to repeat this match as often as possible until the last ; in the test string (if there is one).

ad 3.) try

(?:[^&;]*(&[a-zA-Z0-9_-]+;[^&;]*)*)(;.*)$

it fixes the broken entity matching and returns the longest ; -initial suffix.

tested with pagecolumn regex tester (i'm not affiliated with this website).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM