简体   繁体   中英

PHP Regex batch update

In short, I want to talk about my problem;

$text = 'Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.';
$text = preg_replace('#(?<!((alt|src)="))Lorem(?!(.*("|<\/a>)))#i', '<a href="Lorem" title="Lorem" style="color: inherit;">\0</a>', $text);
$text = preg_replace('#(?<!((alt|src)="))Ipsum(?!(.*("|<\/a>)))#i', '<a href="Ipsum" title="Ipsum" style="color: inherit;">\0</a>', $text);
echo $text;

" Lorem " changes, but " Ipsum " does not change.

The result of php above:

 <a href="Lorem" title="Lorem" style="color: inherit;">Lorem</a> Ipsum is simply dummy text of the printing and typesetting industry. <a href="Lorem" title="Lorem" style="color: inherit;">Lorem</a> Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing <a href="Lorem" title="Lorem" style="color: inherit;">Lorem</a> Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of <a href="Lorem" title="Lorem" style="color: inherit;">Lorem</a> <a href="Ipsum" title="Ipsum" style="color: inherit;">Ipsum</a>. 

Why doesn't " Ipsum " change?

Edited:

If you comment out the first preg_replace line the - used to be - second preg_replace will work just fine. PHP Fiddle 1 hit F9 to run

Also if you swap the the places of the two preg_replace 's you'll get " Ipsum " replaced but not " Lorem " PHP Fiddle 2

So, if the two words are not initially in anchor tags <a> , you don't need to have the lookbehind and lookahead conditions, or at least, not in the second preg_replace , otherwise the two lookaround conditions will be true PHP Fiddle 3 ( 1 )


Update:

As mentioned in a comment by the OP, when using the above there will be a problem if string $text has <a> tags with same criteria words, something like :

 <a href="">test Lorem test</a>

In this case, using REGEX alone won't do it IMHO, Instead we need to do the following:

  1. Check for any occurrence of anchor tags <a> in the string $text .
  2. Use an array $tempArr , as a temporary storage to store link elements.
  3. Replace every link element with some text that has a different form, with number as an unique ID, final result: tempRep#0 , tempRep#1 .. etc, one for and in place of every link element.
  4. Run the REGEX statement(s) ( 2 ) .
  5. Now we reverse the process in step #3, we replace tempRep#0 , tempRep#1 .. etc, with their corresponding link elements which we have temporarily stored as array elements in the $tempArr , matching the number in each unique ID with the same array index number ( 3 ) .

The above algorithm could be achieved with JavaScript because we need some Document Object Model checking, but as the OP said, Javascript is not an option, so we need to make use of the PHP Document Object Model by loading the string $text as HTML, and make use of these PHP DOM commands: getElementsByTagName() , getAttribute() and textContent ( or alternatively, nodeValue ).

So finally, we have the following:

PHP Fiddle 4 [ Final ]

$text = 'Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of <a href="link1href" title="test1">test Ipsum Lorem test</a> Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of <a href="link2href" title="test2">test Lorem test</a> Lorem Ipsum.';

$dom = new DOMDocument;
$dom->loadHTML($text);
$tempArr = array();
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {  
    $href = $link->getAttribute('href');
    $title = $link->getAttribute('title');
    $textCont = $link->textContent; //Alternatively, $link->->nodeValue could be used too
    $linkElement = '<a href="' . $href . '" title="' . $title . '">' . $textCont . '</a>';
    $tempArr[] = $linkElement;
}

for($i=0; $i < count($tempArr); $i++){
    $text = str_replace($tempArr[$i], 'tempRep#' . $i, $text);
}

$text = preg_replace('#(?<!(alt|src)=")(Lorem|Ipsum)(?!(("|<\/a>)))#i', '<a href="\0" title="\0" style="color: inherit;">\0</a>', $text);

for($i=0; $i < count($tempArr); $i++){
    $text = str_replace('tempRep#' . $i, $tempArr[$i], $text);
}
echo $text;

-----------------------------

Notes:

  1. I have found out that it's only the lookahead condition in the second preg_replace function is what causing the bug, in this PHP Fiddle 5 , I kept the lookbehind and only removed the lookahead and oddly it is still working fine.
  2. I've merged the 2 REGEX statements into one:

     $text = preg_replace('#(?<!(alt|src)=")(Lorem|Ipsum)(?!(("|<\\/a>)))#i', '<a href="\\0" title="\\0" style="color: inherit;">\\0</a>', $text); 
  3. this why we used a unique ID for each replacement.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM