简体   繁体   English

使用Stax进行DTD解析

[英]DTD parsing with Stax

i want to parse xml files which declare a HTML 4.01 Doctype. 我想解析声明HTML 4.01 Doctype的xml文件。

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
[...]
</html>

I using Stax and an XMLResolver for load local dtd 我使用Stax和XMLResolver加载本地dtd

XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
xmlInputFactory.setXMLResolver(new LocalXmlResolver());
xmlOutputFactory = XMLOutputFactory.newInstance();
xmlOutputFactory.createXMLEventWriter(...)


private static final Map<String, String> DTDS = new HashMap<String, String>(){{
    // XHTML 1.0 DTDs
    put("-//W3C//DTD XHTML 1.0 Strict//EN", "xhtml1-strict.dtd");
    put("-//W3C//DTD XHTML 1.0 Transitional//EN", "xhtml1-transitional.dtd");
    put("-//W3C//DTD XHTML 1.0 Frameset//EN", "xhtml1-frameset.dtd");

    put("-//W3C//DTD HTML 4.01//EN", "strict.dtd");
    put("-//W3C//DTD HTML 4.01 Transitional//EN", "loose.dtd");
    put("-//W3C//DTD HTML 4.01 Frameset//EN", "frameset.dtd");
}};

private static final class LocalXmlResolver implements XMLResolver {

        @Override
        public Object resolveEntity(String publicID, String systemID, String baseURI, String namespace) throws XMLStreamException {
            Object result = null;

            String path = XHTML_DTD_PATH + DTDS.get(publicID);

            if (StringUtils.isNotBlank(path)) {
                result = getClass().getClassLoader().getResourceAsStream(path);
            }
            return result;
        }
    }

i retrieved dtd from the ( w3c web site ). 我从( w3c网站 )检索了dtd。 But i had to change this file to remove all comments in nodes like below : 但是我不得不更改此文件以删除节点中的所有注释,如下所示:

 <!ENTITY % ContentType "CDATA"
    -- media type, as per [RFC2045]
    --> 

 <!ENTITY % ContentType "CDATA">

But even after these modifications, i have still this error : 但是即使进行了这些修改,我仍然会遇到此错误:

javax.xml.stream.XMLStreamException: ParseError at [row,col]:[184,11]
Message: The element type is required in the element type declaration.
    [...]
Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[184,11]
Message: The element type is required in the element type declaration.
    at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:598)
    at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(XMLEventReaderImpl.java:83)

in the dtd file, the line 184 is : 在dtd文件中,第184行为:

<!ELEMENT (%fontstyle;|%phrase;) - - (%inline;)* >

any idea ? 任何想法 ?

Thanks 谢谢

HTML is an SGML language, so it has an SGML DTD. HTML是一种SGML语言,因此具有SGML DTD。 You can find some more information about SGML here: http://validator.w3.org/docs/sgml.html 您可以在此处找到有关SGML的更多信息: http : //validator.w3.org/docs/sgml.html

SGML is a bit different than XML, so it's no wonder that an XML parser cannot parse it. SGML与XML有所不同,因此,难怪XML解析器无法解析它。

The main example is: 主要示例是:

comments inside entity declarations (delimited with double hyphens: --this is a comment--) is allowed in SGML DTD whereas is not on XML DTD. SGML DTD中允许实体声明内的注释(以双连字符分隔:-这是一个注释-),而XML DTD则不允许。

For more difference please follow http://www.w3.org/TR/NOTE-sgml-xml-971215#null 有关更多差异,请遵循http://www.w3.org/TR/NOTE-sgml-xml-971215#null

Nevertheless you can't disable DTD parsing for specific DTD by creation your own XMLResolver 但是,您无法通过创建自己的XMLResolver来禁用特定DTD的DTD解析

xmlInput = XMLInputFactory.newInstance();
xmlInput.setXMLResolver(new XMLResolver() {
    @Override
    public Object resolveEntity(String publicID, String systemID, String baseURI, String namespace) throws XMLStreamException {
        ...
        // Disable dtd validation
        if ("The public id you except".equals(publicId)) {
            return IOUtils.toInputStream("");
        }
        ...
    }
});

For html parser consider http://jtidy.sourceforge.net/ or http://jsoup.org/ as solution 对于html解析器,请考虑使用http://jtidy.sourceforge.net/http://jsoup.org/作为解决方案

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM