使用JAXB用html实体解组xml

Question

I need to load wikipedia revision histories into POJOs, so I'm using JAXB to unmarshall the wikipeida data dump (well, individual pages of it). 我需要将Wikipedia修订历史记录加载到POJO中，因此我正在使用JAXB解组Wikipeida数据转储（很好，它的各个页面）。 The problem is that the text nodes occasionally contain entities that are not defined in the wikipedia xml dump. 问题在于文本节点有时包含在Wikipedia xml转储中未定义的实体。 eg: ° (`°' pleases keep in mind that I do not know the complete set of entities that I need to be able to read. My input file is 3tb, so let's just assume that everything html can render is in there.). 例如：°（`＆deg;'请记住，我不知道我需要能够读取的完整实体集。我的输入文件为3tb，所以我们假设html可以呈现的所有内容都在其中。）。

How can I configure JAXB to handle entities that are not valid xml? 如何配置JAXB来处理无效xml的实体？

Here is the SAX Exception that JAXB throws when it encounters an undefined entity: 这是JAXB遇到未定义实体时抛出的SAX异常：

Exception in thread "main" javax.xml.bind.UnmarshalException

 - with linked exception:

[org.xml.sax.SAXParseException: The entity "deg" was referenced, but not declared.]

    at javax.xml.bind.helpers.AbstractUnmarshallerImpl.createUnmarshalException(AbstractUnmarshallerImpl.java:315)

    at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.createUnmarshalException(UnmarshallerImpl.java:481)

    at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:199)

    at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:168)

    at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:137)

    at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:184)

    at com.stottlerhenke.tools.wikiparse.WikipediaIO.readPage(WikipediaIO.java:73)

    at com.stottlerhenke.tools.wikiparse.WikipediaIO.main(WikipediaIO.java:53)

Caused by: org.xml.sax.SAXParseException: The entity "deg" was referenced, but not declared.

    at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)

    at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)

    at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)

    at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)

    at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)

    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(Unknown Source)

    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)

    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)

    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)

    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)

    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)

    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)

    at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:195)

Edit: The input that triggered that exception is the complete revision history for the wikipedia article on the Arctic Circle . 编辑：触发该异常的输入是Arctic Circle上Wikipedia文章的完整修订历史。 The XSD used to generate the JAXB classes is here: http://www.mediawiki.org/xml/export-0.3.xsd 用于生成JAXB类的XSD在这里： http : //www.mediawiki.org/xml/export-0.3.xsd

Edit: The source of this problem was an error on my part -- I was using an initial extractor that did not maintain encoded entities properly. 编辑：这个问题的根源是我的一个错误-我使用的初始提取器无法正确维护编码的实体。 However, I did find a way around this, should anyone have the problem I thought I had. 但是，如果有人遇到我认为的问题，我确实找到了解决方法。 See below. 见下文。

Answer 1

Resolving entities is not the job of JAXB's. 解决实体不是JAXB的工作。 It's the job of the underlying XML parser. 这是基础XML解析器的工作。

What you could do is: 您可以做的是：

read the data yourself using DOM 使用DOM自己读取数据
replace all unresolved entities by something you wish 用您希望的东西替换所有未解决的实体
then, let JAXB handle the result 然后，让JAXB处理结果

Answer 2

This is a hack, but it works in a pinch. 这是一个hack，但是在一定程度上起作用。

I downloaded the html entity definitions from w3.org, and set the doctype of the input xml file to xhtml-transitional, but directed the doctype url to a local dtd: 我从w3.org下载了html实体定义，并将输入xml文件的doctype设置为xhtml-transitional，但将doctype url定向到了本地dtd：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "xhtml1-transitional.dtd">

xhtml1-transitional.dtd, in turn, requires: xhtml1-transitional.dtd则需要：

xhtml-lat1.ent xhtml-lat1.ent
xhtml-special.ent xhtml-special.ent
xhtml-symbol.ent xhtml-symbol.ent

which I sucked down and put along side xhtml1-transitional.dtd 我吸了下来，放在旁边xhtml1-transitional.dtd

(All files are available at: http://www.w3.org/TR/xhtml1/DTD/ ) （所有文件位于： http : //www.w3.org/TR/xhtml1/DTD/ ）

Like I said, ugly as hell, but it did seem to do the job. 就像我说的那样，丑陋如地狱，但它确实做到了。

使用JAXB用html实体解组xml

问题描述

2 个解决方案

解决方案1
1 2009-06-22 23:09:57

解决方案2
0 已采纳 2009-06-23 16:14:28

使用JAXB用html实体解组xml

问题描述

2 个解决方案

解决方案1 1 2009-06-22 23:09:57

解决方案2 0 已采纳 2009-06-23 16:14:28

解决方案1
1 2009-06-22 23:09:57

解决方案2
0 已采纳 2009-06-23 16:14:28