简体   繁体   中英

PHP Regular Expression: How to find every < that doesn't belong to a valid HTML Element

Is it possible to find all < (lower than triangle bracket) (and >) in PHP that do not belong to valid HTML Elements (could be stored in an array)? I'd like to mask these characters automatically.

Example:

$html = '<div class="some class"><pre>5 < 8</pre></div>';

$triangles = getAllTriangles($html);

where getAllTriangles($html) results in only one triangle (the one between 5 and 8), so it could be masked with < while the others stay as the are to get the right output.

EDIT: Actually, the problem I have results from the PHP DOMDocument and it's parser. If I'd like to read a html string as above

$html = '<div class="some class"><pre>5 < 8</pre></div>';

$doc = new DOMDocument();
$doc->loadHTML($html);

$output = $doc->saveHTML();

This will result in

<div class="some class"><pre>5 </pre></div>

because of the triangle. For that, I'd like to mask these characters automatically. I'd would be a real problem to mask the in the html strings I'm reading. After all triangles are masked, I could use DOMDocument as I'd like to.

What I really want to have is a regular expression that replaces all triangles that don't belong to html-tags, the output in the example above would be:

<div class="some class"><pre>5 &lt; 8</pre></div>

More examples:

input:    <pre>while i < 10 do....</pre>
output:   <pre>while i &lt; 10 do....</pre>

input:    <div><button-1></div>
output:   <div>&lt;button-1&gt;</div>

You could try to strip all html tags from your string and use simple string functions to find the < and > characters on the result:

$html = '<div class="some class"><pre>5 < 8</pre></div>';
$no_html = strip_tags($html);
var_dump($no_html);
$count = substr_count($no_html, '<');
var_dump($count);

See the example .

However , please note that this approach may fail as your "html" string is not valid html as the < and > that are not part of html tags should be encoded as &lt; and &gt; .

If you need something different than the count, I would recommend using an html parser instead of regular expressions and possibly use regular expressions on the contents you find with the html parser. The same note about non-valid html applies here as well.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM