Xerces DOM解析器非常慢？

Question

Currently, I am trying to clean up an HTML file using JTidy, convert it to XHTML and provide the results to a DOM parser. 目前，我正在尝试使用JTidy清理HTML文件，将其转换为XHTML并将结果提供给DOM解析器。 The following code is the result of these efforts: 以下代码是这些努力的结果：

public class HeaderBasedNewsProvider implements INewsProvider {

    /* ... */

    public Collection<INewsEntry> getNewsEntries() throws NewsUnavailableException {
            Document document;
        try {
            document = getCleanedDocument();
        } catch (Exception e) {
            throw new NewsUnavailableException(e);
        }
        System.err.println(document.getDocumentElement().getTextContent());
        return null;
    }

    private final Document getCleanedDocument() throws IOException, SAXException, ParserConfigurationException {
        InputStream input = inputStreamProvider.getInputStream();
        Tidy tidy = new Tidy();
        tidy.setXHTML(true);
        ByteArrayOutputStream tidyOutputStream = new ByteArrayOutputStream();
        tidy.parse(input, tidyOutputStream);
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        factory.setValidating(false);
        InputStream domInputStream = new ByteArrayInputStream(tidyOutputStream.toByteArray());
        System.err.println(factory.getClass());
        return factory.newDocumentBuilder().parse(domInputStream);
    }
}

However, the DOM parser implementation (com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl) on my system seems to be incredibly slow. 但是，我的系统上的DOM解析器实现（com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl）似乎非常慢。 Even for one-line documents such as the following, parsing takes 2-3 minutes: 即使对于如下所示的单行文档，解析也需要2-3分钟：

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><title></title></head><body><div class="text"><h2>Nachricht vom 16. Juni 2011</h2><h1>Titel</h1><p>Mitteilung <a href="dokumente/medienmitteilungen/MM_NR_jglp.pdf" target="_blank">weiter</a> mehr Mitteilung</p></div></body></html>

Note that - in contrast to the DOM parser - JTidy finishes its work within a second. 请注意 - 与DOM解析器相反 - JTidy在一秒钟内完成其工作。 Therefore, I suspect that I'm somehow misusing the DOM API. 因此，我怀疑我在某种程度上滥用了DOM API。

Thanks in advance for any suggestions on this one! 提前感谢您对此提出任何建议！

Answer 1

Even when not validating, a XML parser needs to fetch the DTD, for example to support named character entities. 即使没有验证，XML解析器也需要获取DTD，例如支持命名字符实体。 You should look into implementing an EntityResolver that resolves the request for the DTD to a local copy. 您应该考虑实现一个EntityResolver ，它将DTD请求解析为本地副本。

Answer 2

HTML dtd's are huge, using includes. HTML dtd非常庞大，使用包含。 They take forever. 他们永远。 Use an XML catalog . 使用XML目录。 There one can store the dtds locally and map them by their system ID. 可以在本地存储dtd并按系统ID映射它们。

If you use a tool, like maven, you will find sufficient pointers. 如果你使用像maven这样的工具，你会发现足够多的指针。

The advantage io intercepting entities as the accepted answer suggests, is that you receive the correct characters. 拦截实体作为公认答案的优点表明，您收到了正确的字符。

Xerces DOM解析器非常慢？

问题描述

2 个解决方案

解决方案1
7 已采纳 2011-10-31 17:03:27

解决方案2
2 2013-02-07 14:17:55

Xerces DOM解析器非常慢？

问题描述

2 个解决方案

解决方案1 7 已采纳 2011-10-31 17:03:27

解决方案2 2 2013-02-07 14:17:55

解决方案1
7 已采纳 2011-10-31 17:03:27

解决方案2
2 2013-02-07 14:17:55