I'm trying to write a regex that will find a string of HTML tags inside a code editor ( Khan Live Editor ) and give the following error:
"You can't put <h1.. 2.. 3..> inside <p> elements."
This is the string I'm trying to match:
<p> ... <h1>
This the string I don't want to match:
<p> ... </p><h1>
Instead the expected behavior is that another error message appears in this situation.
So in English I want a string that;
- starts with <p>
and
- ends with <h1>
but
- does not contain </p>
.
It's easy enough to make this work if I don't care about the existence of a </p>
. My expression looks like this, /<p>.*<h[1-6]>/
and it works fine. But I need to make sure that </p>
does not come between the <p>
and <h1>
tags (or any <h#>
tag, hence the <h[1-6]>
).
I've tried a lot of different expressions from some other posts on here:
Regular expression to match a line that doesn't contain a word?
From which I tried: <p>^((?!<\\/p>).)*$</h1>
regex string does not contain substring
From which I tried: /^<p>(?!<\\/p>)<h1>$/
Regular expression that doesn't contain certain string
This link suggested: aa([^a] | a[^a])aa
Which doesn't work in my case because I need the specific string " </p>
" not just the characters of it since there might be other tags between <p> ... <h1>
.
I'm really stumped here. The regex I've tried seems like it should work... Any idea how I would make this work? Maybe I'm implementing the suggestions from other posts wrong?
Thanks in advance for any help.
Edit:
To answer why I need this done:
The problem is that <p><h1></h1></p>
is a syntax error since h1
closes the first <p>
and there is an unmatched </p>
. The original syntax error is not informative, but in most cases it is correct; my example being the exception. I'm trying to pass the syntax parser a new message to override the original message if the regex finds this exception.
Sometimes it's better to break a problem down.
var str = "YOUR INPUT HERE";
str = str.substr(str.indexOf("<p>"));
str = str.substr(0,str.lastIndexOf("<h1>"));
if( str.indexOf("</p>") > -1) {
// there is a <p>...</p>...<h1>
}
else {
// there isn't
}
This code doesn't handle the case of "what if there is no <p>
to begin with" very well, but it does give a basic idea of how to break a problem down into simpler parts, without using regex.
Search for <p>
followed by any number of characters ( [^]
means any character that is not nothing, this allows us to also capture newlines) that are not followed by </p>
which is eventually followed by <h[1-6]>
.
/<p>(?:[^](?!<\/p>))*<h[1-6]>/gi
const strings = [ '<p> ... <h1>', '<p> ... </p><h1>', '<P> Hello <h1>', '<p></p><h1>', '<p><h1>' ]; const regex = /<p>(?:(?!<\\/p>)[^])*<h[1-6]>/gi; const test = input => ({ input, test: regex.test(input), matches: input.match(regex) }); for(let input of strings) console.log(JSON.stringify(test(input))); // { "input": "<p> ... <h1>", "test": true, "matches": ["<p> ... <h1>"] } // { "input": "<p> ... </p><h1>", "test": false, "matches": null } // { "input": "<P> Hello <h1>", "test": true, "matches": ["<P> Hello <h1>"] } // { "input": "<p></p><h1>", "test": false, "matches": null } // { "input": "<p><h1>", "test": true, "matches": ["<p><h1>"] }
.as-console-wrapper { max-height: 100% !important; min-height: 100% !important; }
Your first regular expression was close, but needed to remove the ^
and $
characters. If you need to match across newlines, you should use [/s/S]
instead of .
.
Here's the final regex: <p>(?:(?!<\\/p>)[\\s\\S])*<h[1-6]>
However, having a header tag ( <h1>
- <h6>
) is perfectly legal inside a paragraph element. They're just considered sibling elements, with the paragraph element ending where the header element begins.
A p element's end tag may be omitted if the p element is immediately followed by an address , article , aside , blockquote , dir , div , dl , fieldset , footer , form , h1 , h2 , h3 , h4 , h5 , h6 , header , hr , menu , nav , ol , p , pre , section , table , or ul element, or if there is no more content in the parent element and the parent element is not an a element.
I'm reaching the conclusion that using a regular expression to find the error is going to turn your one problem into two problems.
Consequently, I think a better approach is to do a very simplistic form of tree parsing. A "poor-man's HTML parser", if you will.
Use a simple regular expression to simply find all tags in the HTML, and put them into a list in the same order in which they were found. Ignore the text nodes between the tags.
Then, walk through the list in order, keeping a running tally on the tags. Increment the P counter when you get a <p>
tag, and decrement it when you get a </p>
tag. Increment the H counter and the H counter when you get to a <h1>
(etc.) tag, decrement on the closing tag.
If the H counter is > 0 while the P counter is > 0, that's your error.
I know im not formatting it correctly but I think the logic will work,
(just replace the AND and NOT with the correct symbols):
/(<p>.*<h[1-6]>)AND !(<p>.*</p><h[1-6]>)/
Let me know how it goes :)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.