简体   繁体   中英

How to get metadata for cropped PDF using PDFBox

I am using the following code to get metadata for each and every character

PDFTextStripper stripper = new PDFTextStripper()
{
    @Override
    protected void processTextPosition(TextPosition text)
    {
        System.out.println(text.toString()+" : " + text.getX() +" : " +text.getY());

    }
}

Above code is working fine. Now, i have cropped some part of the pdf and saved it. If i applied the same logic on this cropped PDF, it is giving the metadata of all characters which are there in parent PDF.

Please suggest me on how to get metadata of characters which are there only in cropped PDF.

Thanks in advance.

A cropped PDF page essentially is a PDF page for which a crop box is defined, ie a rectangle on the canvas, and PDF viewers know they are expected to only display stuff inside that box.

If you during text extraction want to respect that crop box, you simply have to filter by coordinate. For simple text extraction you can do so by using the PDFTextStripperByArea and use its getTextForRegion method.

As you are not simply taking the string returned by the text stripper but instead inject your code by overloading a method which is called before the filtering done by that class, you'll have to filter yourself.

Be aware you need to filter according to the PDF page coordinate system, not the adjusted PDFTextStripper coordinates based on page rotation so that the upper left is 0,0 . This means for a TextPosition text that you have to use

text.getTextMatrix().getTranslateX(), text.getTextMatrix().getTranslateY()

instead of text.getX(), text.getY() .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM