简体   繁体   中英

comment or highlight two-column pdf using pdf-clown

I have searched for possible solution by googling/so/forums for pdfClown/pdfbox and posting the problem at SO.

Problem: I have been trying to find a solution to highlight text, which spans across multiple lines in pdf document. The pdf can have one/two-column pages.

By using pdf-clown, I was able to highlight phrases, ONLY if all the words appear in the same line. pdfBox has created the XML for individual words, I could not find solution for phrases/lines.

Please suggest solution for pdf-clown, if any. (or) any other tool that is capable of highlighting text in multiple lines in pdf, with JAVA compatibility.

I could not understand the answer similar question, but iText, any help?: Multiline markup annotations with iText

Multi-column text is, at the moment (PDF Clown 0.1.2), not supported for extraction: the current algorithm gathers text laying on the same horizontal baseline without evaluating possible gaps between columns.

Automatic multi-column-layout detection would be possible yet somewhat tricky , as PDF is essentially (you know) an unstructured graphic format. Nonetheless, I'm considering to experiment something about that , in order to deal at least with the most common scenarios.

In the meantime, I can suggest you to try an effective workaround (it implies that you work on a document whose columns are placed in predictable areas): for each column do a separate text extraction , instructing the TextExtractor to look into the corresponding page area, then put all those partial extraction results together and apply your filter.

it is possible to get the coordinates of each word in a pdf document using pdfbox, here is the code for it:

import java.io.*;
import org.apache.pdfbox.exceptions.InvalidPasswordException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.TextPosition;

import java.io.IOException;
import java.util.List;

public class PrintTextLocations extends PDFTextStripper {

    public PrintTextLocations() throws IOException {
        super.setSortByPosition(true);
    }

    public static void main(String[] args) throws Exception {

        PDDocument document = null;
        try {
            File input = new File("C:\\path\\to\\PDF.pdf");
            document = PDDocument.load(input);
            if (document.isEncrypted()) {
                try {
                    document.decrypt("");
                } catch (InvalidPasswordException e) {
                    System.err.println("Error: Document is encrypted with a password.");
                    System.exit(1);
                }
            }
            PrintTextLocations printer = new PrintTextLocations();
            List allPages = document.getDocumentCatalog().getAllPages();
            for (int i = 0; i < allPages.size(); i++) {
                PDPage page = (PDPage) allPages.get(i);
                System.out.println("Processing page: " + i);
                PDStream contents = page.getContents();
                if (contents != null) {
                    printer.processStream(page, page.findResources(), page.getContents().getStream());
                }
            }
        } finally {
            if (document != null) {
                document.close();
            }
        }
    }

    protected void processTextPosition(TextPosition text) {
        System.out.println("String[" + text.getXDirAdj() + ","
                + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale="
                + text.getXScale() + " height=" + text.getHeightDir() + " space="
                + text.getWidthOfSpace() + " width="
                + text.getWidthDirAdj() + "]" + text.getCharacter());
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM