如何使用PDFBox获取裁剪的PDF的元数据

Question

I am using the following code to get metadata for each and every character 我正在使用以下代码来获取每个字符的元数据

PDFTextStripper stripper = new PDFTextStripper()
{
    @Override
    protected void processTextPosition(TextPosition text)
    {
        System.out.println(text.toString()+" : " + text.getX() +" : " +text.getY());

    }
}

Above code is working fine. 上面的代码工作正常。 Now, i have cropped some part of the pdf and saved it. 现在，我裁剪了一部分pdf文件并保存了。 If i applied the same logic on this cropped PDF, it is giving the metadata of all characters which are there in parent PDF. 如果我在此裁剪的PDF上应用了相同的逻辑，它将给出父PDF中所有字符的元数据。

Please suggest me on how to get metadata of characters which are there only in cropped PDF. 请为我建议如何获取仅在裁剪的PDF中存在的字符元数据。

Thanks in advance. 提前致谢。

Answer 1

A cropped PDF page essentially is a PDF page for which a crop box is defined, ie a rectangle on the canvas, and PDF viewers know they are expected to only display stuff inside that box. 裁剪后的PDF页面本质上是为其定义了裁剪框的PDF页面，即画布上的矩形，并且PDF查看器知道应该只在该框内显示内容。

If you during text extraction want to respect that crop box, you simply have to filter by coordinate. 如果在文本提取过程中要遵守该裁剪框，则只需按坐标进行过滤。 For simple text extraction you can do so by using the PDFTextStripperByArea and use its getTextForRegion method. 对于简单的文本提取，可以通过使用PDFTextStripperByArea并使用其getTextForRegion方法来进行。

As you are not simply taking the string returned by the text stripper but instead inject your code by overloading a method which is called before the filtering done by that class, you'll have to filter yourself. 因为您不只是获取文本剥离程序返回的字符串，而是通过重载在该类进行过滤之前调用的方法来注入代码，所以您必须自己过滤。

Be aware you need to filter according to the PDF page coordinate system, not the adjusted PDFTextStripper coordinates based on page rotation so that the upper left is 0,0 . 请注意，您需要根据PDF页面坐标系进行过滤，而不是根据页面旋转调整后的PDFTextStripper坐标进行过滤，以便左上角为0,0 。 This means for a TextPosition text that you have to use 这意味着您必须使用TextPosition text

text.getTextMatrix().getTranslateX(), text.getTextMatrix().getTranslateY()

instead of text.getX(), text.getY() . 而不是text.getX(), text.getY() 。

如何使用PDFBox获取裁剪的PDF的元数据

问题描述

1 个解决方案

解决方案1
0 2016-11-18 07:37:22

如何使用PDFBox获取裁剪的PDF的元数据

问题描述

1 个解决方案

解决方案1 0 2016-11-18 07:37:22

解决方案1
0 2016-11-18 07:37:22