简体   繁体   中英

Xpath query not returning all children?

I am attempting to parse data from a remote web page. I retrieve it using curl and then use loadHTML() to create a DOMDocument().

I run into the problem of wanting to select by classname and tried several solutions but the most accepted method seems to be to use an xpath query which I do as:

$classname = 'content-container-primary character-list';
$xpath = new DOMXPath($dom);
$results = $xpath->query("//*[@class='" . $classname . "']");

This finds the container on line 134 of the pastebin linked text which resembles (abbreviated for clarity):

<div class="content-container-primary character-list">
    <ul class="list-group media-list media-list-stream m-t-0">
        <li class="list-group-item p-a">
            <span class="header-buttons"></span>
            <span class="header-text"></span>
        </li>
        <li class="list-group-item p-a visible-xs">
            <div class="row sw-ad-mrec mobile"></div>
        </li>
        <li class="media list-group-item p-a collection-char-list">
            <div class="row">
                <div class="col-xs-6 col-sm-3 col-md-3 col-lg-2">
                </div>
            </div>
        </li>
    </ul>
</div>

I could search for the UL using $classname = 'list-group media-list media-list-stream mt-0'; directly, but I'm concerned that since they're using classnames and not IDs that the UL class name seems less specific and might be reused elsewhere. Finding this UL as a child of the DIV with class 'character-list' gives me some confidence I'm looking at the right object.

What I want to find is all of the "col-xs-6 col-sm-3 col-md-3 col-lg-2" class items. I'm not sure how this name is programmatically generated and I don't want to rely on it, so no xpath search for that, instead I thought to treat these as children of the <li class="media list-group-item pa collection-char-list"> container which itself is a child of the searched for <ul class="list-group media-list media-list-stream mt-0"> container is a child of my $results->item(0) ... so far so good, if I then check the object I find the UL child has been correctly found. Once found I copy it into a temporary DOMDocument so I can do more xpath searching.

$temp_dom = new DOMDocument();
foreach($results->item(0)->childNodes as $n) {
    if ( $n->nodeType === 1 ) {
        if( $n->getAttribute("class") == 'list-group media-list media-list-stream m-t-0' );
            $temp_dom->appendChild($temp_dom->importNode($n,true));
        }
    }
}

So far so good. Now I have confidence I have the UL I want, but when I attempt:

$classname = 'media list-group-item p-a collection-char-list';
$xpathUL = new DOMXPath( $temp_dom );
$result2 = $xpathUL->query("//*[@class='" . $classname . "']");

I get an empty result, DOMNodeList Object ( [length] => 0 ).

Investigating I attempted:

$nodes = array();
$nodes = $temp_dom->getElementsByTagName( 'li' );

foreach ( $nodes as $node ) {
    echo $node->getAttribute( "class" ).'<br>';
}

Which shows the first 2 LI objects but not the third.

If I go back to the original $dom object and perform the xpath search using the classname of the li that I'm interested in it does appear within the results.

I'm uncertain where I've got off track. I know this solution feels pretty clunky but it performs well enough for my purposes.

Please note I have no control over the format of the remote page and have to work with the format as is.

Complete sample is available at this pastebin link .

I've noticed there's a div class "row" in middle of the structure.

You could try a more advanced xpath query string:

//*[@class='media list-group-item p-a collection-char-list']/*[@class='row']/*[@class='col-xs-6 col-sm-3 col-md-3 col-lg-2']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM