I'm creating a little web app to help me manage and analyze the content of my websites, and cURL is my favorite new toy. I've figured out how to extract info about all sorts of elements, how to find all elements with a certain class, etc., but I am stuck on two problems (see below). I hope there is some nifty xpath answer, but if I have to resort to regular expressions I guess that's ok. Although I'm not so great with regex so if you think that's the way to go, I'd appreciate examples...
Pretty standard starting point:
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html) {
$info .= "<br />cURL error number:" .curl_errno($ch);
$info .= "<br />cURL error:" . curl_error($ch);
return $info;
}
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
and extraction of info, for example:
// iframes
$iframes = $xpath->evaluate("/html/body//iframe");
$info .= '<h3>iframes ('.$iframes->length.'):</h3>';
for ($i = 0; $i < $iframes->length; $i++) {
// get iframe attributes
$iframe = $iframes->item($i);
$framesrc = $iframe->getAttribute("src");
$framewidth = $iframe->getAttribute("width");
$frameheight = $iframe->getAttribute("height");
$framealt = $iframe->getAttribute("alt");
$frameclass = $iframe->getAttribute("class");
$info .= $framesrc.' ('.$framewidth.'x'.$frameheight.'; class="'.$frameclass.'")'.'<br />';
}
Questions/Problems:
How to extract HTML comments?
I can't figure out how to identify the comments – are they considered nodes, or something else entirely?
How to get the entire content of a div, including child nodes? So if the div contains an image and a couple of hrefs, it would find those and hand it all back to me as a block of HTML.
Comment nodes should be easy to find in XPath with the comment()
test, analogous to the text()
test:
$comments = $xpath->query('//comment()'); // or another path, as you prefer
They are standard nodes: here is the manual entry for the DOMComment
class .
To your other question, it's a bit trickier. The simplest way is to use saveXML()
with its optional $node
argument:
$html = $dom->saveXML($el); // $el should be the element you want to get
// the HTML for
For the HTML comments a fast method is:
function getComments ($html) {
$rcomments = array();
$comments = array();
if (preg_match_all('#<\!--(.*?)-->#is', $html, $rcomments)) {
foreach ($rcomments as $c) {
$comments[] = $c[1];
}
return $comments;
} else {
// No comments matchs
return null;
}
}
That Regex \\s*<!--[\\s\\S]+?-->
Helps to you.
public function parse($source) {
$comments = array();
// multiline comment /* */
$tmp = explode("/*", $source);
foreach ($tmp as $t) {
if (strpos($t, "*/") !== false) {
$comment = explode("*/", $t)[0];
$comment = trim($comment);
if (!empty($comment)) $comments[] = "/* " . $comment . " */";
}
}
// multiline comment <!-- -->
$tmp = explode("<!--", $source);
foreach ($tmp as $t) {
if (strpos($t, "-->") !== false) {
$comment = explode("-->", $t)[0];
$comment = trim($comment);
if (!empty($comment)) $comments[] = "<!-- " . $comment . " -->";
}
}
$tmp = explode("//", $source);
foreach ($tmp as $t) {
if (empty($t)) continue;
$pos = strpos($source, $t);
if ($pos > 1) {
if ($source[$pos-2] == "/" && $source[$pos-1] == "/") {
$comment = trim(explode("\n", $t)[0]);
if (!empty($comment)) $comments[] = "// " . $comment;
}
}
}
for comments your looking for recursive regex. For instance, to get rid of html comments:
preg_replace('/<!--(?(?=<!--)(?R)|.)*?-->/s',$yourHTML);
to find them:
preg_match_all('/(<!--(?(?=<!--)(?R)|.)*?-->)/s',$yourHTML,$comments);
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.