简体   繁体   中英

How to extract table data from scanned PDF?

I have created a Java project which is fairly successful in parsing searchable PDFs having a particular structure. The tables in it are complex, having merged rows or columns, but in each such PDF, the structure of the tables remain the same, only the text inside changes. I was able to overcome all of these challenges, armed with PDFBox, PDF2Dom and Tabula.

However, the problem arose yesterday when I was provided a fresh set of PDFs which were scanned. Being scanned, the entire content was just images and was not searchable. Feeling the need for OCR, I started researching Tesseract. However, I found that only using it would just cough up the entire text content of the PDF without any context whatsoever, and checkboxes would be lost. So I tried to convert the PDF into a searchable one by using the combo of Ghostscript and Tesseract. I converted the scanned PDF into jpg images using Ghostscript, by the following:

File pdfFile = new File("D://Tess//inputFile.pdf");
List<Image> images = new ArrayList<Image>();
PDFDocument document = new PDFDocument();
document.load(pdfFile);

SimpleRenderer renderer = new SimpleRenderer();
renderer.setResolution(300);

images = renderer.render(document);

for (int i = 0; i < images.size(); i++) {
    Image img = images.get(i);
    ImageIO.write((RenderedImage) img, "jpg", new File(i + ".jpg"));
}

After this, I converted the generated images back to PDFs using Tesseract.

Tesseract tessInst = new Tesseract();
tessInst.setDatapath("D://Tess//tessdata");
List<RenderedFormat> list = new ArrayList<RenderedFormat>();
list.add(RenderedFormat.PDF);

for (int i = 0; i < images.size(); i++)
    tess.createDocuments(i + ".jpg", "D://Tess//output" + i, list);

The PDFs were generated fine, and are even searchable, but when I select a word, the selection highlight is a little skewed from the actual word. Also, the checkboxes cannot be selected. I tried generating a DOM structure by using PDF2Dom, as I had been doing with other PDFs which were searchable without OCR processing and getting great results:

Document document = parser.createDOM(pdf);

This throws the following exception:

java.io.IOException: java.io.IOException: Multi byte glyph name not supported.

at org.mabb.fontverter.pdf.PdfFontExtractor.convertType0FontToOpenType(PdfFontExtractor.java:217)

at org.fit.pdfdom.FontTable$Entry.loadType0TtfDescendantFont(FontTable.java:193)

at org.fit.pdfdom.FontTable$Entry.getData(FontTable.java:146)

at org.fit.pdfdom.FontTable$Entry.isEntryValid(FontTable.java:162)

at org.fit.pdfdom.FontTable.addEntry(FontTable.java:49)

at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:381)

at org.fit.pdfdom.PDFBoxTree.updateFontTable(PDFBoxTree.java:358)

at org.fit.pdfdom.PDFDomTree.updateFontTable(PDFDomTree.java:544)

at org.fit.pdfdom.PDFBoxTree.processPage(PDFBoxTree.java:204)

at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)

at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)

at org.fit.pdfdom.PDFDomTree.createDOM(PDFDomTree.java:218)

at com.pv.pdf.PdfExtractor.extractCheckboxValues(PdfExtractor.java:403)

at com.pv.pdf.PdfExtractor.getMedicalRecordDetails(PdfExtractor.java:372)

at com.pv.servlet.OnServletLogin.doPost(OnServletLogin.java:32)

at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)

at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)

at io.undertow.servlet.handlers.ServletHandler.handleRequest(ServletHandler.java:74)

at io.undertow.servlet.handlers.security.ServletSecurityRoleHandler.handleRequest(ServletSecurityRoleHandler.java:62)

at io.undertow.servlet.handlers.ServletChain$1.handleRequest(ServletChain.java:67)

at io.undertow.servlet.handlers.ServletDispatchingHandler.handleRequest(ServletDispatchingHandler.java:36)

at org.wildfly.extension.undertow.security.SecurityContextAssociationHandler.handleRequest(SecurityContextAssociationHandler.java:78)

at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43)

at io.undertow.servlet.handlers.security.SSLInformationAssociationHandler.handleRequest(SSLInformationAssociationHandler.java:131)

at io.undertow.servlet.handlers.security.ServletAuthenticationCallHandler.handleRequest(ServletAuthenticationCallHandler.java:57)

at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43)

at io.undertow.security.handlers.AbstractConfidentialityHandler.handleRequest(AbstractConfidentialityHandler.java:46)

at io.undertow.servlet.handlers.security.ServletConfidentialityConstraintHandler.handleRequest(ServletConfidentialityConstraintHandler.java:64)

at io.undertow.security.handlers.AuthenticationMechanismsHandler.handleRequest(AuthenticationMechanismsHandler.java:60)

at io.undertow.servlet.handlers.security.CachedAuthenticatedSessionHandler.handleRequest(CachedAuthenticatedSessionHandler.java:77)

at io.undertow.security.handlers.NotificationReceiverHandler.handleRequest(NotificationReceiverHandler.java:50)

at io.undertow.security.handlers.AbstractSecurityContextAssociationHandler.handleRequest(AbstractSecurityContextAssociationHandler.java:43)

at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43)

at org.wildfly.extension.undertow.security.jacc.JACCContextIdHandler.handleRequest(JACCContextIdHandler.java:61)

at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43)

at org.wildfly.extension.undertow.deployment.GlobalRequestControllerHandler.handleRequest(GlobalRequestControllerHandler.java:68)

at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43)

at io.undertow.servlet.handlers.ServletInitialHandler.handleFirstRequest(ServletInitialHandler.java:292)

at io.undertow.servlet.handlers.ServletInitialHandler.access$100(ServletInitialHandler.java:81)

at io.undertow.servlet.handlers.ServletInitialHandler$2.call(ServletInitialHandler.java:138)

at io.undertow.servlet.handlers.ServletInitialHandler$2.call(ServletInitialHandler.java:135)

at io.undertow.servlet.core.ServletRequestContextThreadSetupAction$1.call(ServletRequestContextThreadSetupAction.java:48)

at io.undertow.servlet.core.ContextClassLoaderSetupAction$1.call(ContextClassLoaderSetupAction.java:43)

at org.wildfly.extension.undertow.security.SecurityContextThreadSetupAction.lambda$create$0(SecurityContextThreadSetupAction.java:105)

at org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1526)

at org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1526)

at org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1526)

at org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1526)

at io.undertow.servlet.handlers.ServletInitialHandler.dispatchRequest(ServletInitialHandler.java:272)

at io.undertow.servlet.handlers.ServletInitialHandler.access$000(ServletInitialHandler.java:81)

at io.undertow.servlet.handlers.ServletInitialHandler$1.handleRequest(ServletInitialHandler.java:104)

at io.undertow.server.Connectors.executeRootHandler(Connectors.java:360)

at io.undertow.server.HttpServerExchange$1.run(HttpServerExchange.java:830)

at org.jboss.threads.ContextClassLoaderSavingRunnable.run(ContextClassLoaderSavingRunnable.java:35)

at org.jboss.threads.EnhancedQueueExecutor.safeRun(EnhancedQueueExecutor.java:1985)

at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.doRunTask(EnhancedQueueExecutor.java:1487)

at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1378)

at java.lang.Thread.run(Unknown Source)

Caused by: java.io.IOException: Multi byte glyph name not supported.

at org.mabb.fontverter.converter.PsType0ToOpenTypeConverter.convertCmap(PsType0ToOpenTypeConverter.java:89)

at org.mabb.fontverter.converter.PsType0ToOpenTypeConverter.convert(PsType0ToOpenTypeConverter.java:50)

at org.mabb.fontverter.pdf.PdfFontExtractor.convertType0FontToOpenType(PdfFontExtractor.java:215)

... 57 more

I found this issue which was present in Ghostscript regarding glyph widths:

https://github.com/tesseract-ocr/tesseract/issues/712

However, I'm not sure if it can help me in this current use case. But it also tells of the selected text highlights being skewed as in my case. I'm using Ghost4j version 1.0.1, which is equivalent to Ghostscript version 9.25, so the problem described herein should have been removed.

Please help me with this problem. Thanking you in advance.

EDIT

I am not blaming Ghostscript for the error. But as I found a similar issue to mine while searching, I have provided it here, so that if it indeed points to the root problem, it would be comparatively easy for more learned people to answer my problem.

EDIT

I think my problem can be pinned down to the fact that Tesseract is creating a "glyphless" font for the output PDF, and since it's glyphless, somehow the DOM structure cannot be generated since it does not have a glyph lookup table for the font. I tried searching for how to change the output font, but no luck there. The closest I got was this:

https://unix.stackexchange.com/questions/306051/tesseract-is-it-possible-to-change-font-output-in-ocred-pdf/353191#353191

But I don't know what sort of changes would be required for this to work. This should have been provided by Tesseract as a configurable parameter.

Text Extraction of a PDF created with a PDF writer is already a non-trivial undertaking. Adding the complexity of needing to understand it being laid out in tabular form adds another layer of complexity. Having to OCR scanned images to convert them to invisible Text in the PDF adds yet another layer of complexity.

Perhaps the OCR software has a precision issue with where it places the text on the PDF page in relation to where the characters lie in the image data when the image is overlaid. This would cause higgling the text to look amiss. This may be a shortcoming in the software or could simply be you need to adjust some tweak-able parameters to fine tune your OCR results in this case.

A good test would perhaps be to use a commercial offering, eg Adobe Acrobat to perform OCR on a particular image-only PDF and then attempt to see if they indeed get the positioning as you would expect it to be or suffer from a similar problem.

As far as your precise exception, I had good luck with tracking down this to here in the FontVert java library (not sure if you are using the library directly) which appears to be a homegrown offering.

Perhaps you could inquire with that company/individual if this is simply a design limitation in their software (I think it is because I'm not clear on why you need to convert font formats in this case).

Reading checkboxes is delving outside of OCR support and getting into OMR support. Which would add yet another layer of complexity to what you're doing today.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM