从大量数据生成大PDF

Question

I read data from database from which I generate HTML DOM. 我从数据库中读取数据，从中生成HTML DOM。 The data volume is huge so it cannot fit in memory at once, however it can be provided chunk-by-chunk. 数据量巨大，因此它不能同时适应内存，但它可以逐块提供。

I would like to transform resulting HTML into PDF using Flying Saucer : 我想使用Flying Saucer将生成的HTML转换为PDF：

import org.xhtmlrenderer.pdf.ITextRenderer;
import org.dom4j.DocumentFactory;
import org.dom4j.Element;
import org.dom4j.io.DOMWriter;

OutputStream bodyStream = outputMessage.getBody();

ITextRenderer renderer = new ITextRenderer();

DocumentFactory documentFactory = DocumentFactory.getInstance();
DOMWriter domWriter = new DOMWriter();

Element htmlNode = documentFactory.createElement("html");
Document htmlDocument = documentFactory.createDocument(htmlNode);

int currentLine = 1;
int currentPage = 1;

try {
    while (currentLine <= numberOfLines) {
        currentLine += loadDataToDOM(documentFactory, htmlNode, currentLine, CHUNK_SIZE);

        renderer.setDocument(domWriter.write(htmlDocument), null);
        renderer.layout();

        if (currentPage == 1) {
            // For the first page the PDF writer is created:
            renderer.createPDF(bodyStream, false);
        }
        else {
            // Other documents are appended to current PDF writer:
            renderer.writeNextDocument(currentPage);
        }

        currentPage += renderer.getRootBox().getLayer().getPages().size();
    }

    // Finalise the PDF:
    renderer.finishPDF();
}
catch (DocumentException e) {
    throw new IOException(e);
}
catch (org.dom4j.DocumentException e) {
    throw new IOException(e);
}
finally {
    IOUtils.closeQuietly(bodyStream);
}

The problem with this approach is that the last page of chunk is not necessarily completely filled with data. 这种方法的问题在于块的最后一页不一定完全填充数据。 Is there any solution to fill the space? 有没有解决方案填补这个空间？ For example I could think about the approach that will check that last page is not filed completely and then discard it (not write to PDF), also find out which data was rendered on that page and rewind the position in database ( currentLine in example). 例如，我可以考虑一种方法，它将检查最后一页是否未完全归档，然后丢弃它（不写入PDF），还要找出在该页面上呈现的数据并在数据库中currentLine位置（示例中为currentLine ）。 Would be nice if one can post a complete solution. 如果可以发布一个完整的解决方案会很好。

Answer 1

As I already mentioned in the comments, you are wasting memory and processing time by creating a PDF from a data source by creating HTML first and then converting the HTML to PDF. 正如我在评论中已经提到的那样，通过先创建HTML然后将HTML转换为PDF，您可以通过从数据源创建PDF来浪费内存和处理时间。 You're also introducing plenty of unnecessary complexity. 你还引入了许多不必要的复杂性。

In your comment, you mention low-level functionality such as moveTo() and lineTo() . 在您的评论中，您提到了低级功能，例如moveTo()和lineTo() 。 It would indeed be madness to draw a table using low-level operations that draw every single line and ever single word. 使用绘制每一行和单个单词的低级操作来绘制表格确实是疯狂的。

You should use the PdfPTable class. 您应该使用PdfPTable类。 The ArrayToTable example is a very simple POC where the data comes in the form of a List<List<String>> . ArrayToTable示例是一个非常简单的POC，其中数据以List<List<String>>的形式出现。 The code is as simple as this: 代码就像这样简单：

PdfPTable table = new PdfPTable(8);
table.setWidthPercentage(100);
List<List<String>> dataset = getData();
for (List<String> record : dataset) {
    for (String field : record) {
        table.addCell(field);
    }
}
document.add(table);

Of course: you are talking about a huge data set, in which case, you may not want to build up the table in memory first and then flush the memory when the table is added to the document. 当然：您正在谈论一个庞大的数据集，在这种情况下，您可能不希望先在内存中构建table ，然后在将表添加到文档时刷新内存。 You'll want to add small parts of the table while you are building it. 在构建表时，您需要添加表的一小部分。 That's what happens in the MemoryTests example. 这就是MemoryTests示例中发生的事情。 Add this line: 添加此行：

table.setComplete(false);

And you can add the table little by little (in the example: every 10 rows). 您可以一点一点地添加表格（在示例中：每10行）。 When you've finished adding cells to the table, you should do this: 当您完成向表格添加单元格后，您应该这样做：

table.setComplete(true);
document.add(table);

This will add the final rows. 这将添加最后一行。

If you want a table with a repeating header and/or footer, take a look at the tables in this PDF: header_footer_1.pdf 如果您想要一个包含重复页眉和/或页脚的表，请查看此PDF中的表： header_footer_1.pdf

The HeaderFooter1 and HeaderFooter2 examples will show you how it's done. HeaderFooter1和HeaderFooter2示例将向您展示它是如何完成的。

Answer 2

This is not an answer to the precise question you asked, so if this post is useless I'll delete it. 这不是你问的确切问题的答案，所以如果这篇文章没用，我会删除它。

Since the document is huge, you may well get the best results by emitting the data as LaTeX and then running it through pdflatex . 由于文档很大，您可以通过将数据作为LaTeX发布然后通过pdflatex运行来获得最佳结果。

Advantages: 好处：

LaTeX source of the kind you need is simple to emit - no more complicated than HTML. 您需要的LaTeX源很容易发出 - 不比HTML复杂。
The whole TeX system is designed to produce beautiful and huge documents. 整个TeX系统旨在生成美丽而庞大的文档。 LaTeX is processed as a stream of pages. LaTeX作为页面流处理。 The number of pages has essentially no effect on RAM resources required. 页数对所需的RAM资源基本没有影响。
You get the full power of a typesetting language to make your pages look great. 您将获得排版语言的全部功能，使您的页面看起来很棒。 Want fancy headers? 想要花哨的标题？ Nicely positioned page numbers? 位置很好的页码？ Section headings? 章节标题？ Clickable Table of Contents, etc. etc. No problem. 可点击的目录等等没问题。
LaTeX is available free for all major operating systems. LaTeX适用于所有主要操作系统。

Disadvantages: 缺点：

LaTeX is a native executable, not a Java lib. LaTeX是本机可执行文件，而不是Java库。

If you are interested in this, I can flesh out more details. 如果你对此感兴趣，我可以充实细节。

从大量数据生成大PDF

问题描述

2 个解决方案

解决方案1
6 2014-06-26 14:01:50

解决方案2
4 2014-07-03 22:45:40

从大量数据生成大PDF

问题描述

2 个解决方案

解决方案1 6 2014-06-26 14:01:50

解决方案2 4 2014-07-03 22:45:40

解决方案1
6 2014-06-26 14:01:50

解决方案2
4 2014-07-03 22:45:40