简体   繁体   中英

How to detect OCR in a scanned Document with pdfbox 2.0.0?

The Problem: I have a large folder with many subfolders with many pdfs in them. Some of them already have OCR on them. Some of them don't. So i wanted to write a Java Program to filter the non OCR PDFs out and copy them to a hot folder.

I tested like 20 Documents and what they all have in common is, that if you open them with editor, you can find the word 'font' and the OCR ones and you cant find it in the non OCR ones. My Question now is: How do i implement this check using PDFbox 2.0.0 ? All the solutions i found dont seem to work only with older versions. And I'm not capable of finding a solution in the documentation. (which is clearly my fault)

Thanks in advance.

Here's how to find out if fonts are on the top level of a page:

    PDDocument doc = PDDocument.load(new File(...));
    PDPage page = doc.getPage(0); // 0 based
    PDResources resources = page.getResources();
    for (COSName fontName : resources.getFontNames())
    {
        System.out.println(fontName.getName());
    }
    doc.close();

Re: mkl suggestion, here's how to extract text:

    PDFTextStripper stripper = new PDFTextStripper();
    stripper.setStartPage(1); // 1 based
    stripper.setEndPage(1);
    String extractedText = stripper.getText(doc);
    System.out.println(extractedText);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM