How can I replace strings NOT within a link tag?

Question

I am working on this PHP function. The idea is to wrap certain words occuring in a string into certain tags (both, words and tags, given in an array). It works OK!, but when those words occur into a linked text or its 'src' attribute, then of course the link is broken and stuffed with tags, or tags that should not be inside a link are generated. This is what I have now:

function replace() {
  $terminos = array (
  "beneficios" => "h3",
  "valoracion" => "h2",
  "empresarios" => "h2",
  "tecnologias" => "h2",
  "...and so on..." => "etc",
  );

  foreach ($terminos as $key => $value)
  {
  $body = "string where the word empresarios should be replaced; but the word <a href='http://www.empresarios.com'>empresarios</a> should not be replaced inside <a> tags nor in the URL of their 'src' attribute.";
  $tagged = "<".$value.">".$key."</".$value.">";
  $result = str_replace($key, $tagged, $body);
  }
}

The function, in this example, should return "string where the word <h2>empresarios</h2> should be replaced; but the word <a href='http://www.empresarios.com'>empresarios</a> should not be replaced inside <a> tags nor in the URL of their 'src' attribute."

I'd like this replacement function to work all throught the string, but not inside tags nor in its attributes!

(I'd like to do what is mentioned in the following thread, it's just that it's not in javascript what I need, but in PHP: /questions/1666790/how-to-replace-text-not-within-a-specific-tag-in-javascript )

Answer 1

Use the DOM and only modify text nodes:

$s = "foo <a href='http://test.com'>foo</a> lorem bar ipsum foo. <a>bar</a> not a test";
echo htmlentities($s) . '<hr>';

$d = new DOMDocument;
$d->loadHTML($s);

$x = new DOMXPath($d);
$t = $x->evaluate("//text()");

$wrap = array(
    'foo' => 'h1',
    'bar' => 'h2'
);

$preg_find = '/\b(' . implode('|', array_keys($wrap)) . ')\b/';

foreach($t as $textNode) {
    if( $textNode->parentNode->tagName == "a" ) {
        continue;
    }

    $sections = preg_split( $preg_find, $textNode->nodeValue, null, PREG_SPLIT_DELIM_CAPTURE);

    $parentNode = $textNode->parentNode;

    foreach($sections as $section) {  
        if( !isset($wrap[$section]) ) {
            $parentNode->insertBefore( $d->createTextNode($section), $textNode );
            continue;
        }

        $tagName = $wrap[$section];
        $parentNode->insertBefore( $d->createElement( $tagName, $section ), $textNode );
    }

    $parentNode->removeChild( $textNode );
}

echo htmlentities($d->saveHTML());

Edited to replace DOMText with DOMText and DOMElement as necessary.

Answer 2

To the answer you pointed, in JS, it's basically the same. You just have to specify it's a string.

$regexp = "/(<pre>(?:[^<](?!\/pre))*<\/pre>)|(\:\-\))/gi";

Also note that you may be need another preg_replace function to replace the word 'empresarios' in case it's capitalized (Empresarios) or like weird stuff (EmPreSAriOS).

Also take care of your HTML. <h2> are block elements and may be interpretated this way:

string where the word empresarios should be replaced;

And replaced

string where the word

empresarios

should be replaced;

Maybe what you'll need to use is a <big> tag.

Answer 3

Definitely use a dom parser to isolate the qualifying text nodes before attempting to replace with a regex pattern that respects: word boundries, case-insensitivity, and unicode characters. If you are planning to specifically target words with unicode characters, then you will need to add mb_ to some of the string functions.

After leveraging the following insights, I tailored a solution for your scenario.

Code: ( Demo )

$html = <<<HTML
foo <a href='http://test.com'>fóo</a> lórem
bár ipsum bar food foo bark. <a>bar</a> not á test
HTML;

$lookup = [
    'foo' => 'h3',
    'bar' => 'h2'
];

libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$xpath = new DOMXPath($dom);

$regexNeedles = [];
foreach ($lookup as $word => $tagName) {
    $regexNeedles[] = preg_quote($word, '~');
}
$pattern = '~\b(' . implode('|', $regexNeedles) . ')\b~iu' ;

foreach($xpath->query('//*[not(self::a)]/text()') as $textNode) {
    $newNodes = [];
    $hasReplacement = false;
    foreach (preg_split($pattern, $textNode->nodeValue, 0, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE) as $fragment) {
        $fragmentLower = strtolower($fragment);
        if (isset($lookup[$fragmentLower])) {
            $hasReplacement = true;
            $a = $dom->createElement($lookup[$fragmentLower]);
            $a->nodeValue = $fragment;
            $newNodes[] = $a;
        } else {
            $newNodes[] = $dom->createTextNode($fragment);
        }
    }
    if ($hasReplacement) {
        $newFragment = $dom->createDocumentFragment();
        foreach ($newNodes as $newNode) {
            $newFragment->appendChild($newNode);
        }
        $textNode->parentNode->replaceChild($newFragment, $textNode);
    }
}
echo substr(trim(utf8_decode($dom->saveHTML($dom->documentElement))), 3, -4);

Output:

<h3>foo</h3> <a href="http://test.com">fóo</a> lórem
bár ipsum <h2>bar</h2> food <h3>foo</h3> bark. <a>bar</a> not á test

How can I replace strings NOT within a link tag?

Question

3 answers

solution1
2 ACCPTED 2010-01-30 00:10:53

solution2
0 2010-01-30 00:05:55

empresarios

solution3
0 2020-10-15 13:40:53

How can I replace strings NOT within a link tag?

Question

3 answers

solution1 2 ACCPTED 2010-01-30 00:10:53

solution2 0 2010-01-30 00:05:55

empresarios

solution3 0 2020-10-15 13:40:53

solution1
2 ACCPTED 2010-01-30 00:10:53

solution2
0 2010-01-30 00:05:55

solution3
0 2020-10-15 13:40:53