简体   繁体   中英

Get DOMXpath results below previous result in HTML

I'm trying to sort through the HTML of an external website and, unfortunately, the site is very poorly organized. The data might look something like this:

<a class="title">Title One</a>
<a class="item">Item One</a>
<a class="item">Item Two</a>

<a class="title">Title Two</a>
<a class="item">Item One</a>
<a class="item">Item Two</a>    

And I'm working with an xpath query like this for the titles:

$titles = $x->evaluate('//a[@class="title"]');

Now, I want to list the titles with the items below them. Unfortunately, none of these elements are conveniently wrapped in a parent div, so I can't just filter through everything in the parent. So, I use a query like this for the items:

$titles = $x->evaluate('//a[@class="item"]');

Ideally, what I'd like to do is ONLY check for results below the current title element. So, if I'm looping through and hit "title one", I want to only check the "item" results that appear between title one and title two. Is there any way to do this?

Modifying the HTML is not an option here. I know this question is a little ridiculous and my explanation might be horrible, but if there's a solution, it would really help me!

Thanks everyone.

You can find the title elements first and then use the ->nextSibling() to move forward:

$html =<<<EOM
<a class="title">Title One</a>
<a class="item">Item One</a>
<a class="item">Item Two</a>

<a class="title">Title Two</a>
<a class="item">Item One</a>
<a class="item">Item Two</a>
EOM;

$d = new DOMDocument;
$d->loadHTML($html);
$x = new DOMXPath($d);
foreach ($x->query('//a[@class="title"]') as $node) {
    echo "Title: {$node->nodeValue}\n";
    // iterate the siblings
    while ($node = $node->nextSibling) {
       if ($node->nodeType != XML_ELEMENT_NODE) {
            continue; // skip text nodes
        }
        if ($node->getAttribute('class') != 'item') {
            // no more .item
            break;
        }
        echo "Item: {$node->nodeValue}\n";
    }
}

Output:

Title: Title One
Item: Item One
Item: Item Two
Title: Title Two
Item: Item One
Item: Item Two

You want to select all following siblings of the <a> element with the class="title" that are again <a> elements but with the class="item" and that have the first preceding sibling <a> element with class="title" being that exact first element you start to look from.

Eg in xpath, for example you're looking for the first title element:

//a[class="title"][1]

For that element the item elements are as followed:

//a[@class="title"][1]
    /following-sibling::a[
      @class="item" 
      and preceding-sibling::a[@class="title"][1] 
          = //a[@class="title"][1]
    ]

If you want to make use of that in code, you can do so by creating a relative expression to the title element and using DOMelement::getNodePath() :

$titles = $xp->query('//a[@class="title"]');
foreach ($titles as $title)
{
    echo $title->nodeValue, ":\n";
    $query = './following-sibling::a[@class="item" and 
              preceding-sibling::a[@class="title"][1] = ' .
              $title->getNodePath() . ']';
    foreach ($xp->query($query, $title) as $item)
    {
        echo ' * ', $item->nodeValue, "\n";
    }    
}

Output:

Title One:
 * Item One
 * Item Two
Title Two:
 * Item Three
 * Item Four

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM