[英]parsing HTML file to DOM Tree for extract (Java)
so I am trying to parse a HTML file into the DOM Tree and extract nodes via an XPath expression. 因此,我尝试将HTML文件解析为DOM树并通过XPath表达式提取节点。
I can successfully parse the HTML into the DOM Tree, however when I try to extract Nodes via XPath I am getting nothing out. 我可以成功地将HTML解析为DOM树,但是当我尝试通过XPath提取Node时,我什么都没得到。
Please note this is only a code snippet for relevance. 请注意,这只是相关的代码段。
import org.cyberneko.html.parsers.DOMParser;
import org.dom4j.Document;
import org.dom4j.Node;
import org.dom4j.io.DOMReader;
import org.xml.sax.InputSource;
DOMParser parser = new DOMParser();
parser.parse(new InputSource("file:///Z:/homepage.htm"));
org.w3c.dom.Document doc = parser.getDocument();
DOMReader reader = new DOMReader();
Document document = reader.read(doc);
@SuppressWarnings("unchecked")
List<Node> nodes = document.selectNodes("//HEAD/LINK");
nodes = 0. 节点= 0。
For completeness, here is a snippet of the HTML: 为了完整起见,以下是HTML的代码段:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<HTML xmlns="http://www.w3.org/1999/xhtml">
<HEAD>
<META content="text/html;charset=UTF-8" http-equiv="Content-Type"/>
<TITLE/>
<LINK
href="wcm/groups/visual/documents/webasset/####_ie_5_css.css"
media="all" rel="stylesheet" type="text/css"/>
<LINK
href="wcm/groups/visual/documents/webasset/####_ie_5_5000_css.css"
media="all" rel="stylesheet" type="text/css"/>
<LINK
href="wcm/groups/visual/documents/webasset/####_ie_6_css.css"
media="all" rel="stylesheet" type="text/css"/>
Many thanks as always, 一如既往的感谢
Joe 乔
I suspect this is namespace-related. 我怀疑这与名称空间有关。
document.selectNodes("//HEAD/LINK");
should be namespace-aware. 应该是名称空间感知的。 eg
例如
document.selectNodes("//*[local-name()='HEAD']/*[local-name()='LINK']");
XPath 2.0 will permit XPath 2.0将允许
document.selectNodes("//:HEAD/:LINK");
@BrianAgnew is right, your problem is namespace related. @BrianAgnew是正确的,您的问题与名称空间有关。
The problem lies here 问题就在这里
<HTML xmlns="http://www.w3.org/1999/xhtml">
Since the document has a default namespace xmlns="http://www.w3.org/1999/xhtml"
your XPath expression //HEAD/LINK
will not work as both the HEAD
and LINK
elements belong to the default namespace (xmlns="http://www.w3.org/1999/xhtml") 由于文档具有默认名称空间
xmlns="http://www.w3.org/1999/xhtml"
您的XPath表达式//HEAD/LINK
将不起作用,因为HEAD
和LINK
元素均属于默认名称空间(xmlns = “http://www.w3.org/1999/xhtml”)
@BrianAgnew suggested using: @BrianAgnew建议使用:
document.selectNodes("//*[local-name()='HEAD']/*[local-name()='LINK']");
For more info on why local-name()
works see 有关为什么
local-name()
起作用的更多信息,请参见
XPATHS and Default Namespaces and the answer on the same thread XPATHS和默认命名空间以及同一线程上的答案
There is another way of selecting these nodes without having to use the local-name() and that is to create an alias for the default namespace and then use that in your XPath expression: 还有另一种选择这些节点而不必使用local-name()的方法,即为默认名称空间创建别名,然后在XPath表达式中使用它:
eg 例如
Map<String, String> namespaceUris = new HashMap<String, String>();
namespaceUris.put("foobar", "http://www.w3.org/1999/xhtml");
XPath xPath = DocumentHelper.createXPath("//foobar:HEAD/foobar:LINK");
xPath.setNamespaceURIs(namespaceUris);
@SuppressWarnings("unchecked")
List<Nodes> selectNodes = xPath.selectNodes(document);
Above we set the alias foobar to be the same URI ( http://www.w3.org/1999/xhtml
) as the default namespace. 在上面,我们将别名foobar设置为与默认名称空间相同的URI(
http://www.w3.org/1999/xhtml
)。 This the allows an xpath expression such as 这允许xpath表达式,例如
//foobar:HEAD/foobar:LINK
to work, of course you can use what ever alias you like. //foobar:HEAD/foobar:LINK
可以正常工作,当然您可以使用自己喜欢的别名。
Here's a sample app that uses both aproaches, its a bit rough but should give you the right idea 这是一个同时使用两种方法的示例应用程序,虽然有点粗糙,但应该为您提供正确的想法
package org.foo.bar.foobar;
import java.io.IOException;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import nu.xom.Nodes;
import org.cyberneko.html.parsers.DOMParser;
import org.dom4j.Document;
import org.dom4j.DocumentHelper;
import org.dom4j.Node;
import org.dom4j.XPath;
import org.dom4j.io.DOMReader;
import org.dom4j.io.XMLWriter;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
public class App
{
public static void main( String[] args ) throws SAXException, IOException
{
DOMParser parser = new DOMParser();
parser.parse(new InputSource("file:///Z:/homepage.htm"));
org.w3c.dom.Document doc = parser.getDocument();
DOMReader reader = new DOMReader();
Document document = reader.read(doc);
XMLWriter xmlWriter = new XMLWriter(System.out);
xmlWriter.write(document);
@SuppressWarnings("unchecked")
List<Node> nodes = document.selectNodes("//*[local-name()='HEAD']/*[local-name()='LINK']");
System.out.println("Number of Nodes: " +nodes.size());
Map<String, String> namespaceUris = new HashMap<String, String>();
namespaceUris.put("foobar", "http://www.w3.org/1999/xhtml");
XPath xPath = DocumentHelper.createXPath("//foobar:HEAD/foobar:LINK");
xPath.setNamespaceURIs(namespaceUris);
@SuppressWarnings("unchecked")
List<Nodes> selectNodes = xPath.selectNodes(document);
System.out.println("Number of nodes: " +selectNodes.size());
}
}
Here's the pom I used for good measure 这是我经常使用的pom
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.foo.bar</groupId>
<artifactId>foobar</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>foobar</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>dom4j</groupId>
<artifactId>dom4j</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>jaxen</groupId>
<artifactId>jaxen</artifactId>
<version>1.1.1</version>
</dependency>
<dependency>
<groupId>nekohtml</groupId>
<artifactId>nekohtml</artifactId>
<version>1.9.6.2</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>
Also see One Fork, How To use Dom4J XPath with XML Namespaces which covers a very similar situation to the one you encountered 另请参阅一个分支,如何将Dom4J XPath与XML命名空间一起使用,它涵盖了与您遇到的情况非常相似的情况
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.