Using regex to remove HTML tags

Question

I need to convert

$text = 'We had <i>fun</i>. Look at <a href="http://example.com">this photo</a> of Joe';

[Edit] There could be multiple links in the text.

to

$text = 'We had fun. Look at this photo (http://example.com) of Joe';

All HTML tags are to be removed and the href value from <a> tags needs to be added like above.

What would be an efficient way to solve this with regex? Any code snippet would be great.

Answer 1

First do a preg_replace to keep the link. You could use:

preg_replace('<a href="(.*?)">(.*?)</a>', '$\2 ($\1)', $str);

Then use strip_tags which will finish off the rest of the tags.

Answer 2

try an xml parser to replace any tag with it's inner html and the a tags with its href attribute.

http://www.php.net/manual/en/book.domxml.php

Answer 3

The DOM solution:

$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//a[@href]') as $node) {
    $textNode = new DOMText(sprintf('%s (%s)',
        $node->nodeValue, $node->getAttribute('href')));
    $node->parentNode->replaceChild($textNode, $node);
}
echo strip_tags($dom->saveHTML());

and the same without XPath:

$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('a') as $node) {
    if($node->hasAttribute('href')) {
        $textNode = new DOMText(sprintf('%s (%s)',
            $node->nodeValue, $node->getAttribute('href')));
        $node->parentNode->replaceChild($textNode, $node);
    }
}
echo strip_tags($dom->saveHTML());

All it does is load any HTML into a DomDocument instance. In the first case it uses an XPath expression, which is kinda like SQL for XML, and gets all links with an href attribute. It then creates a text node element from the innerHTML and the href attribute and replaces the link. The second version just uses the DOM API and no Xpath.

Yes, it's a few lines more than Regex but this is clean and easy to understand and it won't give you any headaches when you need to add additional logic.

Answer 4

I've done things like this using variations of substring and replace. ~~I'd probably use regex today~~ but you wanted an alternative so:

For the <i> tags, I'd do something like:

$text = replace($text, "<i>", "");
$text = replace($text, "</i>", "");

(My php is really rusty, so replace may not be the right function name -- but the idea is what I'm sharing.)

The <a> tag is a bit more tricky. But, it can be done. You need to find the point that <a starts and that the > ends with. Then you extract the entire length and replace the closing </a>

That might go something like:

$start = strrpos( $text, "<a" );
$end = strrpos( $text, "</a>", $start );
$text = substr( $text,  $start, $end );
$text = replace($text, "</a>", "");

(I don't know if this will work, again the idea is what I want to communicate. I hope the code fragments help but they probably don't work "out of the box". There are also a lot of possible bugs in the code snippets depending on your exact implementation and environment)

Reference:

strrpos - http://www.php.net/manual/en/function.strrpos.php
replace - http://www.php.net/manual/en/function.str-replace.php
substr - http://php.net/manual/en/function.substr.php

Answer 5

It's also very easy to do with a parser:

# available from http://simplehtmldom.sourceforge.net
include('simple_html_dom.php');

# parse and echo
$html = str_get_html('We had <i>fun</i>. Look at <a href="http://example.com">this photo</a> of Joe');

$a = $html->find('a');
$a[0]->outertext = "{$a[0]->innertext} ( {$a[0]->href} )";

echo strip_tags($html);

And that produces the code you want in your test case.

Using regex to remove HTML tags

Question

5 answers

solution1
5 ACCPTED 2010-05-05 18:00:31

solution2
1 2010-05-05 17:58:24

solution3
1 2010-05-05 18:53:49

solution4
0 2010-05-05 18:15:21

solution5
0 2010-05-05 19:29:30

Using regex to remove HTML tags

Question

5 answers

solution1 5 ACCPTED 2010-05-05 18:00:31

solution2 1 2010-05-05 17:58:24

solution3 1 2010-05-05 18:53:49

solution4 0 2010-05-05 18:15:21

solution5 0 2010-05-05 19:29:30

solution1
5 ACCPTED 2010-05-05 18:00:31

solution2
1 2010-05-05 17:58:24

solution3
1 2010-05-05 18:53:49

solution4
0 2010-05-05 18:15:21

solution5
0 2010-05-05 19:29:30