简体   繁体   English

使用pdfbox从pdf中提取文本时出错

[英]Error when extracting text from pdf using pdfbox

sample pdf 样本pdf

Sample pdf is a chinese resume, 3 pages, using standard code below 样本pdf是一份中文简历,3页,使用下面的标准代码

PDDocument document =  PDDocument.load(new File(path));
PDFTextStripper stripper = new PDFTextStripper();
text = stripper.getText(document);

Extraction result is like below image, only some words 提取结果如下图所示,只有一些单词

提取结果

If you run the text extraction code and enable logging, you'll see numerous warnings: 如果您运行文本提取代码并启用日志记录,您将看到许多警告:

Feb 12, 2019 5:45:58 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARN: No Unicode mapping for CID+5482 (5482) in font GNPVNR+PingFangSC-Semibold
Feb 12, 2019 5:45:58 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARN: No Unicode mapping for CID+1842 (1842) in font GNPVNR+PingFangSC-Semibold
Feb 12, 2019 5:45:58 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARN: No Unicode mapping for CID+7566 (7566) in font GNPVNR+PingFangSC-Semibold
Feb 12, 2019 5:45:58 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARN: No Unicode mapping for CID+1915 (1915) in font GNPVNR+PingFangSC-Semibold
...

Indeed, when inspecting the PDF one sees that there are numerous subsets of PingFangSC styles embedded but each time 实际上,在检查PDF时,我们发现嵌入了许多PingFangSC样式的子集,但每次都是这样

  • with a ToUnicode map without any entries at all, 使用ToUnicode地图,根本没有任何条目,
  • with an Identity-H encoding, and 使用Identity-H编码,和
  • with an Adobe-Identity-0 ROS , 使用Adobe-Identity-0 ROS

ie without any information which glyph represents which Unicode code point. 即没有任何字形表示哪个Unicode代码点的信息。 Thus, it should not surprise at all that text extraction results are very lacking. 因此,文本提取结果非常缺乏应该不足为奇。

So if you really need to extract the text, ask the source of the PDF to provide a copy which includes the required information. 因此,如果您确实需要提取文本,请要求PDF的来源提供包含所需信息的副本。 If that is not possible, try OCR. 如果无法做到这一点,请尝试OCR。


By the way, a good first check usually is to try and copy&paste the text from Adobe Reader. 顺便说一句,一个好的第一次检查通常是尝试从Adobe Reader复制和粘贴文本。 In the case at hand that also results in mostly missing characters. 在手头的情况下,也导致大多数字符丢失。 That usually means that the information required for text extraction according to the PDF specification is missing. 这通常意味着缺少根据PDF规范提取文本所需的信息。

You'll also find some more backgrounds at the link @Tilman provided in a comment: https://pdfbox.apache.org/2.0/faq.html#text-extraction 您还可以在评论中提供的@Tilman链接中找到更多背景: https ://pdfbox.apache.org/2.0/faq.html#text-extraction

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM