简体   繁体   中英

How can I get this with XPath

I'm writing a Crawler for one of the sites and and came across with this problem.

From this HTML...

<div class="Price">
    <span style="font-size: 14px; text-decoration: line-through; color: #444;">195.90 USD</span>
    <br />
    131.90 USD           
</div>

I need to get only 131.90 USD using XPath.

Tried this...

"//div[@class='Price']"

But it returns different result.

How can i achieve this?

EDIT

I'm using this C# code (simplified for demonstration)

protected override DealDictionary GrabData(HtmlAgilityPack.HtmlDocument html) {
var price = Helper.GetInnerHtml(html.DocumentNode, "//div[@class='Price']/text()");

}

Helper Class

public static class Helper {
    public static String GetInnerText(HtmlDocument doc, String xpath) {
        var nodes = doc.DocumentNode.SelectNodes(xpath);
        if (nodes != null && nodes.Count > 0) {
            var node = nodes[0];
            return node.InnerText.TrimHtml();
        }
        return String.Empty;
    }

    public static String GetInnerText(HtmlNode inputNode, String xpath) {
        var nodes = inputNode.SelectNodes(xpath);
        if (nodes != null && nodes.Count > 0) {
            var node = nodes[0];
            var comments = node.ChildNodes.OfType<HtmlCommentNode>().ToList();
            foreach (var comment in comments)
                comment.ParentNode.RemoveChild(comment);

            return node.InnerText.TrimHtml();
        }
        return String.Empty;
    }

    public static String GetInnerHtml(HtmlDocument doc, String xpath) {
        var nodes = doc.DocumentNode.SelectNodes(xpath);
        if (nodes != null && nodes.Count > 0) {
            var node = nodes[0];
            return node.InnerHtml.TrimHtml();
        }
        return String.Empty;
    }

    public static string GetInnerHtml(HtmlNode inputNode, string xpath) {
        var nodes = inputNode.SelectNodes(xpath);
        if (nodes != null && nodes.Count > 0) {
            var node = nodes[0];
            return node.InnerHtml.TrimHtml();
        }
        return string.Empty;
    }
}

The XPath you tried is a good start:

//div[@class='Price']

This selects any <div> element in the Xml document. You restrict that selection to <div> elements that have a class attribute whose value is Price .

So far, so good - but as you select a <div> element, what you will get back will be a <div> element including all of its contents.

In the Xml fragment you show above, you have the following hierarchical structure:

<div> element
    <span> element
        text node
    <br> element
    text node

So, what you are actually interested in is the latter text node. You can use text() in XPath to select any text nodes. As in this case, you are interested in the first text node that is an immediate child of the <div> element you found, your XPath should look like this:

//div[@class='Price']/text()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM