防止Jsoup.parse刪除結束</ img>標記

Question

我正在使用Jsoup.parse解析一段html。

其他一切都很棒，但我應該稍后在pdf轉換器中解析這個HTML。

由於某種原因，Jsoup.parse刪除了結束標記，而pdf-parser拋出了關於缺少關閉img標記的異常。

Can't load the XML resource (using TRaX transformer). org.xml.sax.SAXParseException; 
lineNumber: 115; columnNumber: 4; The element
type "img" must be terminated by the matching end-tag "</img>"

如何防止Jsoup.parse刪除關閉的img標記？

例如這一行：

<img src="C:\path\to\image\image.png"></img>

轉向：

<img src="C:\path\to\image\image.png">

同樣的情況：

<img src="C:\path\to\image\image.png"/>

這是代碼：

private void createPdf(File file, String content) throws IOException, DocumentException {
        OutputStream os = new FileOutputStream(file);
            content = tidyUpHTML(content);
            ITextRenderer renderer = new ITextRenderer();
            renderer.setDocumentFromString(content);
            renderer.layout();
            renderer.createPDF(os);
        os.close();
    }

這是上面方法中調用的tidyUpHTML方法：

private String tidyUpHTML(String html) {
    org.jsoup.nodes.Document doc = Jsoup.parse(html);
    doc.select("a").unwrap();
    String fixedTags = doc.toString().replace("<br>", "<br />");
    fixedTags = fixedTags.replace("<hr>", "<hr />");
    fixedTags = fixedTags.replaceAll("&nbsp;","&#160;");
    return fixedTags;
}

Answer 1

您的PDF轉換器需要xhtml（因為它需要關閉img標記）。 設置Jsoup以輸出到xhtml（xml）。

org.jsoup.nodes.Document doc = Jsoup.parse(html);
document.outputSettings().syntax( Document.OutputSettings.Syntax.xml);
doc.select("a").unwrap();
String fixedTags = doc.html();

請參閱使用Jsoup 1.8.1將HTML轉換為XHTML是否可行？

防止Jsoup.parse刪除結束</ img>標記

問題描述

1 個解決方案

解決方案1
7 已采納 2016-12-08 13:38:01

防止Jsoup.parse刪除結束</ img>標記

問題描述

1 個解決方案

解決方案1 7 已采納 2016-12-08 13:38:01

解決方案1
7 已采納 2016-12-08 13:38:01