简体   繁体   中英

How can I remove all HTML tags from a PHP string?

I have a PHP string like this:

$string = "<b class='classname'>this</b> is a `<a href='#'>link</a>`

               <p>and this is a test</p>

           Also this is <i>another</i> test.";

I want this output:

$string = "this is a `<a href='#'>link</a>`

               <p>and this is a test</p>

           Also this is another test.";

As you see, I want to remove all HTML tags except:

  • it's surrounded between this "`".
  • there is four space in the beginning of it, plus an enter (blank line) in the top and bottom of it.

Note: I can use strip_tags() to remove all HTML tags, but it will also remove those tags which shouldn't be removed. Also htmlspecialchars() doesn't work as expected.

Well its ugly but works on this example

<?php
function translate($m) {
     if(isset($m[1]) && $m[1] != "") {
       $m[0] = str_replace($m[1], "", $m[0]);
       return strip_tags($m[1]).$m[0];
     }else {
       return strip_tags($m[0]);
     }
  }

$re = "/(.*)`.*`|\n((?<![[:space:]]{4})(.*)\n)/m";
$string = "this is a `<a href='#'>link</a>`

               <p>and this is a test</p>

           Also this is another test.";
$string = $string.$string.$string.$string;
echo preg_replace_callback($re, "translate", $string);
?>

Output:

this is a `<a href='#'>link</a>`

               <p>and this is a test</p>

           Also this is another test.this is a `<a href='#'>link</a>`

               <p>and this is a test</p>

           Also this is another test.this is a `<a href='#'>link</a>`

               <p>and this is a test</p>

           Also this is another test.this is a `<a href='#'>link</a>`

               <p>and this is a test</p>

           Also this is another test.

As I see it, to do this properly, you should use/write a parser. If you absolutely can trust the input (you can't in about 99.9% of the cases), there are two approaches that might appeal to you.

placeholder replacement

you can match your `(.*)`|\\n .*\\n and replace it by unique placeholders, strip_tags the rest of the string (essentially) and put the original code back in.

splitting

preg_split has the ability to remember the delimiters (also see comments on the linked page). So essentially you can declare `(.*)`|\\n .*\\n as a delimiter, and just output the array by strip_tags ing every other entry.

final remarks

You should remember that there's always some edge case. As I see it, you want to copy stack overflow's syntax. SO's syntax also has <code>...</code> to mark code as code, as well as <pre></pre> to typeset some of it. There are problems, when you want to make ` inside backticks, which is done by using double backticks as intro (and would ruin both approaches before). Also SO's syntax is, that you have not one but two linebreaks before the indented code block.

some kinds of input aren't unambiguous. For example, what about

    `something to dream about

but I want to show

    that everything's simple`

that could either be two code lines with a sentence in the middle or it could be code with line breaks. So far your problem description is not clear on that part. It also might imply you haven't understood the problem well enough.

In the end, the safest solution is a parser, where you have fine grained control over what actually happens, because regular expressions almost always seem to have an edge case where they fail or behave unexpectedly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM