简体   繁体   English

XPath:帮助在使用HTMLUnit抓取的DOM中查找特定元素

[英]XPath: Help in locating a specific element in a DOM scraped using HTMLUnit

I am scraping a webpage using HTMLUnit and have collected a List of DOM nodes from the webpage. 我正在使用HTMLUnit抓取网页,并已从网页中收集了DOM节点列表。

Inside each of these "company" DOM nodes is some data I want to scrape. 在这些“公司” DOM节点的每个内部,都有一些我想抓取的数据。 For example I want the telephone number text from inside this node: 例如,我要从此节点内部输入电话号码文本:

Now, this element would be a child of a div element which is in turn a child of another div element inside the company node. 现在,此元素将是div元素的子元素,而div元素又是公司节点内另一个div元素的子元素。 What would be the correct XPath line to access it? 访问它的正确XPath行是什么? Here is my latest attempt which returned nothing. 这是我最近的尝试,没有任何回报。

 List<DomNode> companies = (List<DomNode>) page.getByXPath("//li[@class='featured block twoblock    boxshadow']");
        for (int j = 0; j < companies.size(); j++) {

            DomNode company = companies.get(j);

                // retrieve telephone number
                DomNode telephone = (DomNode) company.getByXPath(
                        "//li[@data-pvd-p='"+j+1+"']/div[@class='listingWrapper']/div[@class='itemInfo']/span[@class='tel']").get(0);

} }

Here is a sample of the HTML, what: 这是HTML的示例,内容:

        <li class="featured block twoblock boxshadow" data-pvd-p="3" data-pvd-c="0046176330000011028" data-pvd-et="sv" data-pvd-l="true">

    <div class="listingWrapper" itemtype="http://schema.org/LocalBusiness" itemscope="">
        <a href="/Craddock-Electrical-Services-Ltd/0046176330000011028/"></a>
        <div class="itemInfo">
            <div class="tradeImage" itemprop="member" itemscope="" itemtype="http://schema.org/Organization"></div>
            <h2>
                <a itemprop="name" href="/Craddock-Electrical-Services-Ltd/0046176330000011028/"></a>
            </h2>
            <span class="tel" itemprop="telephone"></span>
            <div class="listLinks"></div>
            <div id="addressBar"></div>
        </div>
        <div class="itemInfo2"></div>
        <div class="clearLeft"></div>
        <ul class="features"></ul>
        <div class="clearLeft"></div>
        <p class="promo" itemprop="description"></p>
    </div>
</li>

UPDATE 2: 更新2:

Here is the current state of my XPath code. 这是我的XPath代码的当前状态。

List<DomNode> companies = (List<DomNode>) page
                .getByXPath("//li[contains(@class, 'featured block')]");
        for (int j = 0; j < companies.size(); j++) {

            String url = "";
            DomNode company = companies.get(j);
            DomElement web = null;

            // retrieve name
            DomNode name = (DomNode) company.getByXPath("//a[@itemprop='name']").get(j);

            if (companiesLogged.contains(name.getTextContent().trim()) != true) {
                companiesLogged.add(name.getTextContent().trim());

                // retrieve telephone number
                DomNode telephone = (DomNode) company.getByXPath("div[@class='listingWrapper']/div[@class='itemInfo']/span[@class='tel']").get(0);


                // retrieve website
                try{
                web = (DomElement) company.getByXPath("div[@class='listingWrapper']/div[@class='itemInfo']" +
                        "/div[@class='listLinks']/a[@id='linkWebsite']").get(0);
                } catch(IndexOutOfBoundsException e){
                    System.out.print(" (No Website) ");
                }

                try{
                url = web.getAttribute("href");
                } catch (IndexOutOfBoundsException e){
                    url = "N/A";
                }

                System.out.println(name.getTextContent().trim() + "   "
                        + telephone.getTextContent().trim()
                 +"   "+url.trim());

            } else {
                System.out.println("Company already logged");
            }
        }

First thing I see is how you're retrieving the group of <li> nodes. 我首先看到的是您如何检索<li>节点组。 Just looking at your @class attribute, you can't really tell how many spaces are in " featured block twoblock boxshadow ", but that XPath will only return a result if it is exactly equal to it . 仅查看您的@class属性,您实际上无法确定“ featured block twoblock boxshadow ”中有多少空格,但是XPath仅在与之完全相等时才返回结果。 In that regard, try using something more flexible like contains() , ie //li[contains(@class, 'featured block')] . 在这方面,请尝试使用更灵活的东西,如contains() ,即//li[contains(@class, 'featured block')]

Without seeing what source you're targeting I can't suggest much more, but will update the answer when it's added to the question. 没有看到您针对的是什么来源,我无法提供更多建议,但是当它添加到问题中时,它将更新答案。

I've tried your XPath (just the /div part, since that's what was provided) on the given snippet and got back <span class="tel" itemprop="telephone"/> as a result. 我已经在给定的代码段上尝试了您的XPath(仅提供了/ div部分,因为已经提供了该功能),结果返回了<span class="tel" itemprop="telephone"/> Looks like an issue with how you're retrieving the <li> company nodes. 看起来像如何检索<li>公司节点的问题。

Update 2: From your updated XML snippet, your first XPath //li[@class='featured block twoblock boxshadow']" doesn't look like it will match the parent <li> node, based on what I mentioned with the spaces before. Secondly if it did, you are checking the <li> node's attributes twice on separate queries, and assuming that the index you're giving the data-pvd-p value (starts at 3 in the snippet) will always match the list index (starts at 0, with your +1 added). I'd suggest removing this portion //li[@data-pvd-p='"+j+1+"'] and beginning with the //div . 更新2:在更新的XML代码段中,根据我在空格中提到的内容,您的第一个XPath //li[@class='featured block twoblock boxshadow']"看起来不匹配父<li>节点其次,如果这样做的话,您要在单独的查询中检查<li>节点的属性两次,并假设您提供的data-pvd-p值(在代码段中从3开始)的索引将始终与列表匹配索引(从0开始,添加了+1)。我建议删除//li[@data-pvd-p='"+j+1+"']并从//div开始。

So something like this: 所以像这样:

List<DomNode> companies = (List<DomNode>) page.getByXPath("//li[contains(@class, 'featured block']");
        for (DomNode node : companies) {

                // retrieve telephone number
                DomNode telephone = (DomNode) node.getByXPath(
                        "div[@class='listingWrapper']/div[@class='itemInfo']/span[@class='tel']").get(0);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM