简体   繁体   English

获取DOMXpath结果低于HTML中的上一个结果

[英]Get DOMXpath results below previous result in HTML

I'm trying to sort through the HTML of an external website and, unfortunately, the site is very poorly organized. 我正在尝试对外部网站的HTML进行排序,不幸的是,该网站的组织非常糟糕。 The data might look something like this: 数据可能如下所示:

<a class="title">Title One</a>
<a class="item">Item One</a>
<a class="item">Item Two</a>

<a class="title">Title Two</a>
<a class="item">Item One</a>
<a class="item">Item Two</a>    

And I'm working with an xpath query like this for the titles: 我正在使用像这样的xpath查询标题:

$titles = $x->evaluate('//a[@class="title"]');

Now, I want to list the titles with the items below them. 现在,我想列出标题下面的项目。 Unfortunately, none of these elements are conveniently wrapped in a parent div, so I can't just filter through everything in the parent. 不幸的是,这些元素都没有方便地包装在父div中,所以我不能只过滤父节点中的所有内容。 So, I use a query like this for the items: 所以,我对这些项使用这样的查询:

$titles = $x->evaluate('//a[@class="item"]');

Ideally, what I'd like to do is ONLY check for results below the current title element. 理想情况下,我想要做的只是检查当前标题元素下面的结果。 So, if I'm looping through and hit "title one", I want to only check the "item" results that appear between title one and title two. 所以,如果我循环并点击“标题一”,我只想检查标题一和标题二之间出现的“项目”结果。 Is there any way to do this? 有没有办法做到这一点?

Modifying the HTML is not an option here. 此处不能修改HTML。 I know this question is a little ridiculous and my explanation might be horrible, but if there's a solution, it would really help me! 我知道这个问题有点荒谬,我的解释可能很糟糕,但是如果有解决办法的话,那真的会对我有所帮助!

Thanks everyone. 感谢大家。

You can find the title elements first and then use the ->nextSibling() to move forward: 您可以先找到title元素,然后使用->nextSibling()继续前进:

$html =<<<EOM
<a class="title">Title One</a>
<a class="item">Item One</a>
<a class="item">Item Two</a>

<a class="title">Title Two</a>
<a class="item">Item One</a>
<a class="item">Item Two</a>
EOM;

$d = new DOMDocument;
$d->loadHTML($html);
$x = new DOMXPath($d);
foreach ($x->query('//a[@class="title"]') as $node) {
    echo "Title: {$node->nodeValue}\n";
    // iterate the siblings
    while ($node = $node->nextSibling) {
       if ($node->nodeType != XML_ELEMENT_NODE) {
            continue; // skip text nodes
        }
        if ($node->getAttribute('class') != 'item') {
            // no more .item
            break;
        }
        echo "Item: {$node->nodeValue}\n";
    }
}

Output: 输出:

Title: Title One
Item: Item One
Item: Item Two
Title: Title Two
Item: Item One
Item: Item Two

You want to select all following siblings of the <a> element with the class="title" that are again <a> elements but with the class="item" and that have the first preceding sibling <a> element with class="title" being that exact first element you start to look from. 你想选择<a>元素的所有以下兄弟,其中class="title"再次是<a>元素但是带有class="item"并且具有第一个前面的兄弟<a>元素,带有class="title"是你开始看的确切的第一个元素。

Eg in xpath, for example you're looking for the first title element: 例如在xpath中,例如,您正在寻找第一个title元素:

//a[class="title"][1]

For that element the item elements are as followed: 对于该元素, item元素如下:

//a[@class="title"][1]
    /following-sibling::a[
      @class="item" 
      and preceding-sibling::a[@class="title"][1] 
          = //a[@class="title"][1]
    ]

If you want to make use of that in code, you can do so by creating a relative expression to the title element and using DOMelement::getNodePath() : 如果要在代码中使用它,可以通过创建title元素的相对表达式并使用DOMelement::getNodePath()

$titles = $xp->query('//a[@class="title"]');
foreach ($titles as $title)
{
    echo $title->nodeValue, ":\n";
    $query = './following-sibling::a[@class="item" and 
              preceding-sibling::a[@class="title"][1] = ' .
              $title->getNodePath() . ']';
    foreach ($xp->query($query, $title) as $item)
    {
        echo ' * ', $item->nodeValue, "\n";
    }    
}

Output: 输出:

Title One:
 * Item One
 * Item Two
Title Two:
 * Item Three
 * Item Four

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM