繁体   English   中英

Java HTML XPath 选择器

[英]Java HTML XPath selector

I am trying to find a library like C# htmlagilitypack for java to parse HTML and select elements using XPath.

我已经阅读了许多库,但它们都不是 HTML 的独立 XPath 选择器,我发现的所有库都需要使用htmlunit之类的方法解析 HTML。

如果有人可以通过 XPath 2.0 或 3.0 和 HTML 解析的简单示例来指导我,我将不胜感激。

Java 支持Xpath 通常用于解析 XML 文件。 但是,它也应该适用于 HTML。

HTML 样品:

<html lang="en">
<head>
    <title>Index page</title>
</head>
<body>
<div>
    <br/>
    <h1>Hello <span id="my-demo">User!</span></h1>
    <br/>
    <img src="https://s3.amazonaws.com/acloudguru-opsworkslab/ACG_Austin.JPG" alt="photo"/>
</div>
</body>
</html>

代码片段:

public class HtmlXpathParser {
    private DocumentBuilder builder;
    private XPath path;

    public HtmlXpathParser() throws ParserConfigurationException {
        DocumentBuilderFactory dbfactory = DocumentBuilderFactory.newInstance();
        builder = dbfactory.newDocumentBuilder();
        XPathFactory xpfactory = XPathFactory.newInstance();
        path = xpfactory.newXPath();
    }

    public Optional<String> parse(String fileName) throws SAXException, IOException, XPathExpressionException {
        File file = new File(fileName);

        Document doc = builder.parse(file);
        String result = path.evaluate("//img/@src", doc);

        return Optional.of(result);
    }

    public static void main(String[] args) throws ParserConfigurationException, XPathExpressionException, SAXException, IOException {
        HtmlXpathParser parser = new HtmlXpathParser();

        Optional<String> srcResult = parser.parse("src/main/resources/index.html");
        srcResult.ifPresent(System.out::println);
    }
}

Output:

https://s3.amazonaws.com/acloudguru-opsworkslab/ACG_Austin.JPG

它适用于 XPath 版本 1。如果需要,可以使用xpath2-parser 之类的东西。

有用的参考资料:

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM