Xerces DOM解析器非常慢？

Question

目前，我正在嘗試使用JTidy清理HTML文件，將其轉換為XHTML並將結果提供給DOM解析器。 以下代碼是這些努力的結果：

public class HeaderBasedNewsProvider implements INewsProvider {

    /* ... */

    public Collection<INewsEntry> getNewsEntries() throws NewsUnavailableException {
            Document document;
        try {
            document = getCleanedDocument();
        } catch (Exception e) {
            throw new NewsUnavailableException(e);
        }
        System.err.println(document.getDocumentElement().getTextContent());
        return null;
    }

    private final Document getCleanedDocument() throws IOException, SAXException, ParserConfigurationException {
        InputStream input = inputStreamProvider.getInputStream();
        Tidy tidy = new Tidy();
        tidy.setXHTML(true);
        ByteArrayOutputStream tidyOutputStream = new ByteArrayOutputStream();
        tidy.parse(input, tidyOutputStream);
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        factory.setValidating(false);
        InputStream domInputStream = new ByteArrayInputStream(tidyOutputStream.toByteArray());
        System.err.println(factory.getClass());
        return factory.newDocumentBuilder().parse(domInputStream);
    }
}

但是，我的系統上的DOM解析器實現（com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl）似乎非常慢。 即使對於如下所示的單行文檔，解析也需要2-3分鍾：

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><title></title></head><body><div class="text"><h2>Nachricht vom 16. Juni 2011</h2><h1>Titel</h1><p>Mitteilung <a href="dokumente/medienmitteilungen/MM_NR_jglp.pdf" target="_blank">weiter</a> mehr Mitteilung</p></div></body></html>

請注意 - 與DOM解析器相反 - JTidy在一秒鍾內完成其工作。 因此，我懷疑我在某種程度上濫用了DOM API。

提前感謝您對此提出任何建議！

Answer 1

即使沒有驗證，XML解析器也需要獲取DTD，例如支持命名字符實體。 您應該考慮實現一個EntityResolver ，它將DTD請求解析為本地副本。

Answer 2

HTML dtd非常龐大，使用包含。 他們永遠。 使用XML目錄。 可以在本地存儲dtd並按系統ID映射它們。

如果你使用像maven這樣的工具，你會發現足夠多的指針。

攔截實體作為公認答案的優點表明，您收到了正確的字符。

Xerces DOM解析器非常慢？

問題描述

2 個解決方案

解決方案1
7 已采納 2011-10-31 17:03:27

解決方案2
2 2013-02-07 14:17:55

Xerces DOM解析器非常慢？

問題描述

2 個解決方案

解決方案1 7 已采納 2011-10-31 17:03:27

解決方案2 2 2013-02-07 14:17:55

解決方案1
7 已采納 2011-10-31 17:03:27

解決方案2
2 2013-02-07 14:17:55