简体   繁体   中英

JavaScript Regex: Finding a String that does not contain </p>

I'm trying to write a regex that will find a string of HTML tags inside a code editor ( Khan Live Editor ) and give the following error:

"You can't put <h1.. 2.. 3..> inside <p> elements."

This is the string I'm trying to match:

<p> ... <h1>

This the string I don't want to match:

<p> ... </p><h1>

Instead the expected behavior is that another error message appears in this situation.

So in English I want a string that;
- starts with <p> and
- ends with <h1> but
- does not contain </p> .

It's easy enough to make this work if I don't care about the existence of a </p> . My expression looks like this, /<p>.*<h[1-6]>/ and it works fine. But I need to make sure that </p> does not come between the <p> and <h1> tags (or any <h#> tag, hence the <h[1-6]> ).


I've tried a lot of different expressions from some other posts on here:

Regular expression to match a line that doesn't contain a word?

From which I tried: <p>^((?!<\\/p>).)*$</h1>

regex string does not contain substring

From which I tried: /^<p>(?!<\\/p>)<h1>$/

Regular expression that doesn't contain certain string

This link suggested: aa([^a] | a[^a])aa

Which doesn't work in my case because I need the specific string " </p> " not just the characters of it since there might be other tags between <p> ... <h1> .


I'm really stumped here. The regex I've tried seems like it should work... Any idea how I would make this work? Maybe I'm implementing the suggestions from other posts wrong?

Thanks in advance for any help.

Edit:

To answer why I need this done:

The problem is that <p><h1></h1></p> is a syntax error since h1 closes the first <p> and there is an unmatched </p> . The original syntax error is not informative, but in most cases it is correct; my example being the exception. I'm trying to pass the syntax parser a new message to override the original message if the regex finds this exception.

Sometimes it's better to break a problem down.

var str = "YOUR INPUT HERE";
str = str.substr(str.indexOf("<p>"));
str = str.substr(0,str.lastIndexOf("<h1>"));
if( str.indexOf("</p>") > -1) {
    // there is a <p>...</p>...<h1>
}
else {
    // there isn't
}

This code doesn't handle the case of "what if there is no <p> to begin with" very well, but it does give a basic idea of how to break a problem down into simpler parts, without using regex.

Search for <p> followed by any number of characters ( [^] means any character that is not nothing, this allows us to also capture newlines) that are not followed by </p> which is eventually followed by <h[1-6]> .

/<p>(?:[^](?!<\/p>))*<h[1-6]>/gi

RegEx101 Test Case

在此输入图像描述

 const strings = [ '<p> ... <h1>', '<p> ... </p><h1>', '<P> Hello <h1>', '<p></p><h1>', '<p><h1>' ]; const regex = /<p>(?:(?!<\\/p>)[^])*<h[1-6]>/gi; const test = input => ({ input, test: regex.test(input), matches: input.match(regex) }); for(let input of strings) console.log(JSON.stringify(test(input))); // { "input": "<p> ... <h1>", "test": true, "matches": ["<p> ... <h1>"] } // { "input": "<p> ... </p><h1>", "test": false, "matches": null } // { "input": "<P> Hello <h1>", "test": true, "matches": ["<P> Hello <h1>"] } // { "input": "<p></p><h1>", "test": false, "matches": null } // { "input": "<p><h1>", "test": true, "matches": ["<p><h1>"] } 
 .as-console-wrapper { max-height: 100% !important; min-height: 100% !important; } 

Your first regular expression was close, but needed to remove the ^ and $ characters. If you need to match across newlines, you should use [/s/S] instead of . .

Here's the final regex: <p>(?:(?!<\\/p>)[\\s\\S])*<h[1-6]>

However, having a header tag ( <h1> - <h6> ) is perfectly legal inside a paragraph element. They're just considered sibling elements, with the paragraph element ending where the header element begins.

A p element's end tag may be omitted if the p element is immediately followed by an address , article , aside , blockquote , dir , div , dl , fieldset , footer , form , h1 , h2 , h3 , h4 , h5 , h6 , header , hr , menu , nav , ol , p , pre , section , table , or ul element, or if there is no more content in the parent element and the parent element is not an a element.

http://www.w3.org/TR/html-markup/p.html

I'm reaching the conclusion that using a regular expression to find the error is going to turn your one problem into two problems.

Consequently, I think a better approach is to do a very simplistic form of tree parsing. A "poor-man's HTML parser", if you will.

Use a simple regular expression to simply find all tags in the HTML, and put them into a list in the same order in which they were found. Ignore the text nodes between the tags.

Then, walk through the list in order, keeping a running tally on the tags. Increment the P counter when you get a <p> tag, and decrement it when you get a </p> tag. Increment the H counter and the H counter when you get to a <h1> (etc.) tag, decrement on the closing tag.

If the H counter is > 0 while the P counter is > 0, that's your error.

I know im not formatting it correctly but I think the logic will work,

(just replace the AND and NOT with the correct symbols):

/(<p>.*<h[1-6]>)AND !(<p>.*</p><h[1-6]>)/

Let me know how it goes :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM