I am attempting to parse data from a remote web page. I retrieve it using curl and then use loadHTML() to create a DOMDocument().
I run into the problem of wanting to select by classname and tried several solutions but the most accepted method seems to be to use an xpath query which I do as:
$classname = 'content-container-primary character-list';
$xpath = new DOMXPath($dom);
$results = $xpath->query("//*[@class='" . $classname . "']");
This finds the container on line 134 of the pastebin linked text which resembles (abbreviated for clarity):
<div class="content-container-primary character-list">
<ul class="list-group media-list media-list-stream m-t-0">
<li class="list-group-item p-a">
<span class="header-buttons"></span>
<span class="header-text"></span>
</li>
<li class="list-group-item p-a visible-xs">
<div class="row sw-ad-mrec mobile"></div>
</li>
<li class="media list-group-item p-a collection-char-list">
<div class="row">
<div class="col-xs-6 col-sm-3 col-md-3 col-lg-2">
</div>
</div>
</li>
</ul>
</div>
I could search for the UL using $classname = 'list-group media-list media-list-stream mt-0';
directly, but I'm concerned that since they're using classnames and not IDs that the UL class name seems less specific and might be reused elsewhere. Finding this UL as a child of the DIV with class 'character-list' gives me some confidence I'm looking at the right object.
What I want to find is all of the "col-xs-6 col-sm-3 col-md-3 col-lg-2" class items. I'm not sure how this name is programmatically generated and I don't want to rely on it, so no xpath search for that, instead I thought to treat these as children of the <li class="media list-group-item pa collection-char-list">
container which itself is a child of the searched for <ul class="list-group media-list media-list-stream mt-0">
container is a child of my $results->item(0) ... so far so good, if I then check the object I find the UL child has been correctly found. Once found I copy it into a temporary DOMDocument so I can do more xpath searching.
$temp_dom = new DOMDocument();
foreach($results->item(0)->childNodes as $n) {
if ( $n->nodeType === 1 ) {
if( $n->getAttribute("class") == 'list-group media-list media-list-stream m-t-0' );
$temp_dom->appendChild($temp_dom->importNode($n,true));
}
}
}
So far so good. Now I have confidence I have the UL I want, but when I attempt:
$classname = 'media list-group-item p-a collection-char-list';
$xpathUL = new DOMXPath( $temp_dom );
$result2 = $xpathUL->query("//*[@class='" . $classname . "']");
I get an empty result, DOMNodeList Object ( [length] => 0 ).
Investigating I attempted:
$nodes = array();
$nodes = $temp_dom->getElementsByTagName( 'li' );
foreach ( $nodes as $node ) {
echo $node->getAttribute( "class" ).'<br>';
}
Which shows the first 2 LI objects but not the third.
If I go back to the original $dom
object and perform the xpath search using the classname of the li that I'm interested in it does appear within the results.
I'm uncertain where I've got off track. I know this solution feels pretty clunky but it performs well enough for my purposes.
Please note I have no control over the format of the remote page and have to work with the format as is.
Complete sample is available at this pastebin link .
I've noticed there's a div class "row" in middle of the structure.
You could try a more advanced xpath query string:
//*[@class='media list-group-item p-a collection-char-list']/*[@class='row']/*[@class='col-xs-6 col-sm-3 col-md-3 col-lg-2']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.