[英]Convert malformed HTML to PDF using Flying Saucer PDF Rendering
In a project GitHub I'm trying to convert any arbitrary HTML string into a PDF version.在GitHub项目中,我试图将任意 HTML 字符串转换为 PDF 版本。 By convert I mean parse the HTML, and render it into a PDF file.
通过转换,我的意思是解析 HTML,并将其呈现为 PDF 文件。
To achieve that I'm using Flying Saucer PDF Rendering like this:为了实现这一点,我正在使用飞碟 PDF 渲染,如下所示:
public class Main {
public static void main(String [] args) {
final String ok = "<valid html here>: see github rep for real html markup here";
final String html = "<invalid html here>: see github rep for real html markup here";
try {
// final byte[] bytes = generatePDFFrom(ok); // works!
final byte[] bytes = generatePDFFrom(html); // does NOT work :(
try(FileOutputStream fos = new FileOutputStream("sample-file.pdf")) {
fos.write(bytes);
}
} catch (IOException | DocumentException e) {
e.printStackTrace();
}
}
private static byte[] generatePDFFrom(String html) throws IOException, DocumentException {
final ITextRenderer renderer = new ITextRenderer();
renderer.setDocumentFromString(html);
renderer.layout();
try (ByteArrayOutputStream fos = new ByteArrayOutputStream(html.length())) {
renderer.createPDF(fos);
return fos.toByteArray();
}
}
}
In the above code if I use the html string stored in ok
variable (this is a "valid" html), it creates the PDF correctly (if you run the GitHub project by using the ok
variable it will create a file sample-file.pdf
inside the project folder with some rendered html).在上面的代码中,如果我使用存储在
ok
变量中的 html 字符串(这是一个“有效”的 html),它会正确创建 PDF(如果您使用ok
变量运行 GitHub 项目,它将创建一个文件sample-file.pdf
在项目文件夹中,并带有一些呈现的 html)。
Now, if I use the value in html
variable (html with invalid tags, tags maybe not closed properly, etc) it throws the following error (the error can vary depending on the incorrect value):现在,如果我使用
html
变量中的值(带有无效标签的 html,标签可能未正确关闭等),它会引发以下错误(错误可能因不正确的值而异):
ERROR: 'The markup in the document following the root element must be well-formed.'
Exception in thread "main" org.xhtmlrenderer.util.XRRuntimeException: Can't load the XML resource (using TrAX transformer). org.xml.sax.SAXParseException; lineNumber: 22; columnNumber: 9; The markup in the document following the root element must be well-formed.
at org.xhtmlrenderer.resource.XMLResource$XMLResourceBuilder.transform(XMLResource.java:222)
at org.xhtmlrenderer.resource.XMLResource$XMLResourceBuilder.createXMLResource(XMLResource.java:181)
at org.xhtmlrenderer.resource.XMLResource.load(XMLResource.java:84)
at org.xhtmlrenderer.pdf.ITextRenderer.setDocumentFromString(ITextRenderer.java:171)
at org.xhtmlrenderer.pdf.ITextRenderer.setDocumentFromString(ITextRenderer.java:166)
at Main.generatePDFFrom(Main.java:84)
at Main.main(Main.java:72)
Caused by: javax.xml.transform.TransformerException: org.xml.sax.SAXParseException; lineNumber: 22; columnNumber: 9; The markup in the document following the root element must be well-formed.
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:740)
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:343)
at org.xhtmlrenderer.resource.XMLResource$XMLResourceBuilder.transform(XMLResource.java:220)
... 6 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 22; columnNumber: 9; The markup in the document following the root element must be well-formed.
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1239)
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transformIdentity(TransformerImpl.java:659)
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:728)
... 8 more
Now, as far as I understood this is because of the "invalid" parts of the html string.现在,据我所知,这是因为 html 字符串的“无效”部分。
Important notes:重要笔记:
ok
and html
here are just a placeholder for the question.ok
和html
的值只是问题的占位符。 Real ones are here .Since I had the same issue while using Flying Saucer to generate a PDF from an HTML, I used the HtmlCleaner library (see maven link ) to clean the HTML code before parsing into Flying Saucer library.由于我在使用 Flying Saucer 从 HTML 生成 PDF 时遇到了同样的问题,因此在解析到 Flying Saucer 库之前,我使用了HtmlCleaner库(请参阅maven 链接)来清理 HTML 代码。
// Clean the html to use in the flying saucer converting tool
// get the element you want to serialize
HtmlCleaner cleaner = new HtmlCleaner();
TagNode rootTagNode = cleaner.clean(html);
// set up properties for the serializer (optional, see online docs)
CleanerProperties cleanerProperties = cleaner.getProperties();
// use the getAsString method on an XmlSerializer class
XmlSerializer xmlSerializer = new PrettyXmlSerializer(cleanerProperties);
String cleanedHtml = xmlSerializer.getAsString(rootTagNode);
// use the https://github.com/flyingsaucerproject/flyingsaucer to convert cleaned HTML to PDF
ITextRenderer renderer = new ITextRenderer();
renderer.setDocumentFromString(cleanedHtml);
// ....
An initial thought would be to parse your input through another library that would be able to handle html better and then toString() that library's results into the PDF Renderer.最初的想法是通过另一个能够更好地处理 html 的库来解析您的输入,然后将该库的结果toString()解析到 PDF 渲染器中。
https://jsoup.org/ https://jsoup.org/
Five minutes of Googling found this as a pretty reasonable library to use.五分钟的谷歌搜索发现这是一个非常合理的库使用。 There's even a test utility you can try throwing your malformed input into:
甚至还有一个测试实用程序,您可以尝试将格式错误的输入放入:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.