I have correct several posts on a database that look like these:
<a href="somelink.html"><img src=someimage.jpg border=1 alt="some text"></a>
So I need to:
One thing I tried is to parse the dom and get the SRC source:
$doc = new DOMDocument();
$body = $this->removeUnnecessaryTags($body);
$doc->loadHTML($this->removeUnnecessaryTags($body));
$imageTags = $doc->getElementsByTagName('img');
foreach($imageTags as $tag) {
$result[] = [ 'src' => $tag->getAttribute('src'), 'alt' => $tag->getAttribute('alt') ];
}
I know this can be done with regex but my regex knowledge is not very good. Any ideas?
Thanks
All you need is to use DOMDocument features and libxml options:
$html = '<a href="somelink.html"><img src=someimage.jpg border=1 alt="some text"></a>';
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
$result = $dom->saveXML($dom->documentElement);
echo $result;
LIBXML_HTML_NODEFDTD
prevents to add automatically a DTD when the DTD is missing. LIBXML_HTML_NOIMPLIED
prevents to add html and body tags when missing.
The saveXML method will save your document with an XML compliant syntax, so it solves the self-closing tags problem. $dom->documentElement
is used as parameter to avoid the xml declaration that is automatically added.(*)
Whatever the method you use (saveXML or saveHTML) double quotes are used to enclose attributes automatically.
(*) This will remove an eventual DTD too, so if you want to preserve it, you can use this workaround to remove the xml declaration:
$result = $dom->saveXML();
$result = substr($result, strpos($result, "\n") + 1);
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.