[英]Extract text from PDFbox
I have two similar sample ticket, in that one ticket is reading horizontal and other one vertical.我有两张类似的样票,一张是横向的,另一张是纵向的。
In 1st image result is在第一张图片中,结果是
BOOKING ID : BBT3001402
HI ! YOUR BOOKING AT MATHURA EXECUTIVE IS CONFIRMED!
CHECK IN
31
JANUARY
FRIDAY
NIGHTS
4N
CHECK OUT
4
FEBRUARY
TUESDAY
BOOKING DETAILS:
2nd image result第二张图片结果
BOOKING ID : BBT2601540
HI ! YOUR BOOKING AT VIVANTA BENGALURU, RESIDENCY ROAD IS CONFIRMED!
CHECK IN NIGHTS CHECK OUT
27 7N 03
JANUARY FEBRUARY
WEDNESDAY WEDNESDAY
BOOKING DETAILS:
I want PDFbox to read data in any one fixed format (horizontal/Vertical).我希望 PDFbox 以任何一种固定格式(水平/垂直)读取数据。
PDFBox is for pdf manipulation and it doesn't do OCR out of the box, you need something like ApacheTika or Tesseract OCR PDFBox 用于 pdf 操作,它不做 OCR 开箱即用,你需要像 ApacheTika 或 Tesseract OCR 这样的东西
if the pdf already has text in it you can extract it like this如果 pdf 中已经有文本,您可以像这样提取它
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1); // 1 based
stripper.setEndPage(1);
String extractedText = stripper.getText(doc);
System.out.println(extractedText);
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.