简体   繁体   中英

replace html using DOMDocument in PHP

I'm trying to cleanup some bad html using DOMDocument. The html has an <div class="article"> element, with <br/><br/> instead of </p><p> -- I want to regex these into paragraphs...but can't seem to get my node back into the original document:

//load entire doc
$doc = new DOMDocument();
$doc->loadHTML($htm);
$xpath = new DOMXpath($doc);
//get the article
$article = $xpath->query("//div[@class='article']")->parentNode;
//get as string
$article_htm =   $doc->saveXML($article);
//regex the bad markup
$article_htm2 = preg_replace('/<br\/><br\/>/i', '</p><p>', $article_htm);

//create new doc w/ new html string
$doc2 = new DOMDocument();
$doc2->loadHTML($article_htm2);
$xpath2 = new DOMXpath($doc2);

//get the original article node
$article_old = $xpath->query("//div[@class='article']");
//get the new article node
$article_new = $xpath2->query("//div[@class='article']");

//replace original node with new node
$article->replaceChild($article_old, $article_new);
$article_htm_new = $doc->saveXML();

//dump string
var_dump($article_htm_new);

all i get is a 500 internal server error...not sure what I'm doing wrong.

There are several issues:

  1. $xpath->query returns a nodeList, not a node. You must select an item from the nodeList
  2. replaceChild() expects as 1st argument the new node, and as 2nd the node to replace
  3. $article_new is part of another document, you first must import the node into $doc

Fixed code:

//load entire doc
$doc = new DOMDocument();
$doc->loadHTML($htm);
$xpath = new DOMXpath($doc);
//get the article
$article = $xpath->query("//div[@class='article']")->item(0)->parentNode;
//get as string
$article_htm =   $doc->saveXML($article);
//regex the bad markup
$article_htm2 = preg_replace('/<br\/><br\/>/i', '</p>xxx<p>', $article_htm);

//create new doc w/ new html string
$doc2 = new DOMDocument();
$doc2->loadHTML($article_htm2);
$xpath2 = new DOMXpath($doc2);

//get the original article node
$article_old = $xpath->query("//div[@class='article']")->item(0);
//get the new article node
$article_new = $xpath2->query("//div[@class='article']")->item(0);

//import the new node into $doc
$article_new=$doc->importNode($article_new,true);

//replace original node with new node
$article->replaceChild($article_new, $article_old);
$article_htm_new = $doc->saveHTML();

//dump string
var_dump($article_htm_new);

Instead of using 2 documents you may create a DocumentFragment of $article_htm2 and use this fragment as replacement.

I think it should be

$article->parentNode->replaceChild($article_old, $article_new);

the article is not a child of itself.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM