简体   繁体   English

使用 Flying Saucer PDF Rendering 将格式错误的 HTML 转换为 PDF

[英]Convert malformed HTML to PDF using Flying Saucer PDF Rendering

In a project GitHub I'm trying to convert any arbitrary HTML string into a PDF version.GitHub项目中,我试图将任意 HTML 字符串转换为 PDF 版本。 By convert I mean parse the HTML, and render it into a PDF file.通过转换,我的意思是解析 HTML,并将其呈现为 PDF 文件。

To achieve that I'm using Flying Saucer PDF Rendering like this:为了实现这一点,我正在使用飞碟 PDF 渲染,如下所示:

Main.java 主程序

public class Main {

    public static void main(String [] args) {
        final String ok = "<valid html here>: see github rep for real html markup here";
        final String html = "<invalid html here>: see github rep for real html markup here";
        try {
            // final byte[] bytes = generatePDFFrom(ok); // works!
            final byte[] bytes = generatePDFFrom(html); // does NOT work :(
            try(FileOutputStream fos = new FileOutputStream("sample-file.pdf")) {
                fos.write(bytes);
            }

        } catch (IOException | DocumentException e) {
            e.printStackTrace();
        }
    }

    private static byte[] generatePDFFrom(String html) throws IOException, DocumentException {
        final ITextRenderer renderer = new ITextRenderer();
        renderer.setDocumentFromString(html);
        renderer.layout();
        try (ByteArrayOutputStream fos = new ByteArrayOutputStream(html.length())) {
            renderer.createPDF(fos);
            return fos.toByteArray();
        }
    }
}

In the above code if I use the html string stored in ok variable (this is a "valid" html), it creates the PDF correctly (if you run the GitHub project by using the ok variable it will create a file sample-file.pdf inside the project folder with some rendered html).在上面的代码中,如果我使用存储在ok变量中的 html 字符串(这是一个“有效”的 html),它会正确创建 PDF(如果您使用ok变量运行 GitHub 项目,它将创建一个文件sample-file.pdf在项目文件夹中,并带有一些呈现的 html)。

Now, if I use the value in html variable (html with invalid tags, tags maybe not closed properly, etc) it throws the following error (the error can vary depending on the incorrect value):现在,如果我使用html变量中的值(带有无效标签的 html,标签可能未正确关闭等),它会引发以下错误(错误可能因不正确的值而异):

ERROR:  'The markup in the document following the root element must be well-formed.'
Exception in thread "main" org.xhtmlrenderer.util.XRRuntimeException: Can't load the XML resource (using TrAX transformer). org.xml.sax.SAXParseException; lineNumber: 22; columnNumber: 9; The markup in the document following the root element must be well-formed.
    at org.xhtmlrenderer.resource.XMLResource$XMLResourceBuilder.transform(XMLResource.java:222)
    at org.xhtmlrenderer.resource.XMLResource$XMLResourceBuilder.createXMLResource(XMLResource.java:181)
    at org.xhtmlrenderer.resource.XMLResource.load(XMLResource.java:84)
    at org.xhtmlrenderer.pdf.ITextRenderer.setDocumentFromString(ITextRenderer.java:171)
    at org.xhtmlrenderer.pdf.ITextRenderer.setDocumentFromString(ITextRenderer.java:166)
    at Main.generatePDFFrom(Main.java:84)
    at Main.main(Main.java:72)
Caused by: javax.xml.transform.TransformerException: org.xml.sax.SAXParseException; lineNumber: 22; columnNumber: 9; The markup in the document following the root element must be well-formed.
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:740)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:343)
    at org.xhtmlrenderer.resource.XMLResource$XMLResourceBuilder.transform(XMLResource.java:220)
    ... 6 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 22; columnNumber: 9; The markup in the document following the root element must be well-formed.
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1239)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transformIdentity(TransformerImpl.java:659)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:728)
    ... 8 more

Now, as far as I understood this is because of the "invalid" parts of the html string.现在,据我所知,这是因为 html 字符串的“无效”部分。

Important notes:重要笔记:

  • The values assigned to variables ok and html here are just a placeholder for the question.这里分配给变量okhtml的值只是问题的占位符。 Real ones are here .真正的在这里
  • In the real project, the html string is an input that comes from the user.在实际项目中,html 字符串是来自用户的输入。 Yes, he/she must know what to put there, but, of course, he/she can do some mistakes in the html conformation, so I have to handle this.是的,他/她必须知道在那里放什么,但是,当然,他/她可以在 html 构造中犯一些错误,所以我必须处理这个。

Question(s)问题)

  • Is there any way I can "tell" to Flying Saucer PDF Rendering to ignore / autocomplete / clean itself / or any other , those "invalid" parts and move on with the creation of the PDF file (preferred) .有什么方法可以“告诉”飞碟 PDF 渲染忽略/自动完成/清理自身/或任何其他“无效”部分,然后继续创建 PDF 文件(首选)
  • Is there a better approach I can use in order to overcome this.有没有更好的方法可以用来克服这个问题。

Since I had the same issue while using Flying Saucer to generate a PDF from an HTML, I used the HtmlCleaner library (see maven link ) to clean the HTML code before parsing into Flying Saucer library.由于我在使用 Flying Saucer 从 HTML 生成 PDF 时遇到了同样的问题,因此在解析到 Flying Saucer 库之前,我使用了HtmlCleaner库(请参阅maven 链接)来清理 HTML 代码。

// Clean the html to use in the flying saucer converting tool
// get the element you want to serialize
HtmlCleaner cleaner = new HtmlCleaner();
TagNode rootTagNode = cleaner.clean(html);
// set up properties for the serializer (optional, see online docs)
CleanerProperties cleanerProperties = cleaner.getProperties();
// use the getAsString method on an XmlSerializer class
XmlSerializer xmlSerializer = new PrettyXmlSerializer(cleanerProperties);
String cleanedHtml = xmlSerializer.getAsString(rootTagNode);

// use the https://github.com/flyingsaucerproject/flyingsaucer to convert cleaned HTML to PDF
ITextRenderer renderer = new ITextRenderer();
renderer.setDocumentFromString(cleanedHtml);
// ....

An initial thought would be to parse your input through another library that would be able to handle html better and then toString() that library's results into the PDF Renderer.最初的想法是通过另一个能够更好地处理 html 的库来解析您的输入,然后将该库的结果toString()解析到 PDF 渲染器中。

https://jsoup.org/ https://jsoup.org/

Five minutes of Googling found this as a pretty reasonable library to use.五分钟的谷歌搜索发现这是一个非常合理的库使用。 There's even a test utility you can try throwing your malformed input into:甚至还有一个测试实用程序,您可以尝试将格式错误的输入放入:

https://try.jsoup.org/ https://try.jsoup.org/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM