简体   繁体   English

在 pdfbox 中使用 PDType0Font 时出现 PDFTextStripper().getText 问题

[英]Problem with PDFTextStripper().getText when using PDType0Font in pdfbox

I've started to work with PDType0Font recently (we've used PDType1Font.HELVETICA but needed unicode support) and I'm facing an error where i'm adding lines to the file using PDPageContentStream but PDFTextStripper.getText doesn't get the updated file contents.我最近开始使用 PDType0Font(我们使用了 PDType1Font.HELVETICA 但需要 unicode 支持)并且我面临一个错误,我使用 PDPageContentStream 向文件添加行但 PDFTextStripper.getText 没有得到更新文件内容。

I'm loading the font:我正在加载字体:

PDType0Font.load(document, fontFile)

And creating the contentStream as follows:并按如下方式创建 contentStream:

PDPageContentStream(document, pdPage, PDPageContentStream.AppendMode.PREPEND, false)

my function that adds content to the pdf is:我将内容添加到 pdf 的功能是:

  private fun addTextToContents(contentStream: PDPageContentStream, txtLines: List<String>, x: Float, y: Float, pdfFont: PDFont, fontSize: Float, maxWidth: Float) {
     contentStream.beginText()
     contentStream.setFont(pdfFont, fontSize)
     contentStream.newLineAtOffset(x, y)
     txtLines.forEach { txt ->
       contentStream.showText(txt)
       contentStream.newLineAtOffset(0.0F, -fontSize)
     }
     contentStream.endText()
     contentStream.close()

When i'm trying to read the content of the file using PDFTextStripper.getText i'm getting the file before the changes.当我尝试使用 PDFTextStripper.getText 读取文件内容时,我会在更改之前获取文件。 However, if I'm adding document.save before reading to PDFTextStripper, it works.但是,如果我在阅读 PDFTextStripper 之前添加 document.save,它会起作用。

      val txt: String = PDFTextStripper().getText(doc) //not working

      doc.save(//File)
      val txt: String = PDFTextStripper().getText(doc) //working

if I'm using PDType1Font.HELVETICA in如果我在使用 PDType1Font.HELVETICA

contentStream.setFont(pdfFont, fontSize)

Everything is working without any problems and without saving the doc before reading the text.一切正常,没有任何问题,并且在阅读文本之前无需保存文档。

I'm suspecting that the issue is with the code in PDPageContentStream.showTextInternal():我怀疑问题出在 PDPageContentStream.showTextInternal() 中的代码上:

        // Unicode code points to keep when subsetting
    if (font.willBeSubset())
    {
        int offset = 0;
        while (offset < text.length())
        {
            int codePoint = text.codePointAt(offset);
            font.addToSubset(codePoint);
            offset += Character.charCount(codePoint);
        }
    }

This is the only thing that is not the same when using PDType0Font with embedsubsets and PDType1Font.这是将 PDType0Font 与 embedsubsets 和 PDType1Font 一起使用时唯一不同的地方。

Can someone help with this?有人可以帮忙吗? What am I doing wrong?我究竟做错了什么?

Your question, in particular the quoted code, already hints at the answer to your question:您的问题,特别是引用的代码,已经暗示了您问题的答案:

When using a font that will be subset ( font.willBeSubset() == true ), the associated PDF objects are unfinished until the file is saved.当使用子集字体( font.willBeSubset() == true )时,关联的 PDF 对象在保存文件之前未完成。 Text extraction on the other hand needs the finished PDF objects to properly work.另一方面,文本提取需要完成的 PDF 对象才能正常工作。 Thus, don't apply text extraction to a document that is still being created and uses fonts that will be subset.因此,不要将文本提取应用于仍在创建的文档并使用将是子集的字体。

You describe your use case as您将您的用例描述为

for our unit tests, we are adding text (mandatory text for us) to the document and then using PDFTextStripper we are validating that the file has the proper fields.对于我们的单元测试,我们将文本(对我们来说是强制性文本)添加到文档中,然后使用 PDFTextStripper 验证文件是否具有正确的字段。

As Tilman proposes: Then it would make more sense to save the PDF, and then to reload.正如 Tilman 所建议的那样:那么保存 PDF 然后重新加载会更有意义。 That would be a more realistic test.那将是一个更现实的测试。 Not saving is cutting corners IMHO.恕我直言,不储蓄就是偷工减料。

Indeed, in unit tests you should first produce the final PDF as it will be sent out (ie saving it, either to the file system or to memory), then reload that file, and test only this reloaded document.实际上,在单元测试中,您应该首先生成最终的 PDF,因为它将被发送(即,将其保存到文件系统或内存中),然后重新加载该文件,并仅测试这个重新加载的文档。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM