简体   繁体   English

PDFTextStripper解析错误的编码

[英]PDFTextStripper parsing with wrong encoding

PDFTextStripper stripper = new PDFText2HTML(encoding);
String result = stripper.getText(document).trim();

result contains something like 结果包含类似

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
 "http://www.w3.org/TR/html4/loose.dtd"> <html><head><title>Inserat
 SeLe EE rev</title> <meta http-equiv="Content-Type"
 content="text/html; charset=utf-8"> </head> <body> <div
 style="page-break-before:always;
 page-break-after:always"><div><p>&#0;&#1;&#2;&#3;&#4;&#5;&#6;&#7;&#...

instead of 代替

 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
 "http://www.w3.org/TR/html4/loose.dtd"> <html><head><title>Inserat
 SeLe EE rev</title> <meta http-equiv="Content-Type"
 content="text/html; charset=utf-8"> </head> <body> <div
 style="page-break-before:always; page-break-after:always"><div><p>any
 blablabla characters...

When I changed encoding to windows-1252 or utf-8 result not changed. 当我将编码更改为Windows-1252或utf-8时,结果未更改。 Bad pdf url http://www.permaco.ch/fileadmin/user_upload/jobs/Inserat_SeLe_EE_rev.pdf 错误的pdf网址http://www.permaco.ch/fileadmin/user_upload/jobs/Inserat_SeLe_EE_rev.pdf

How to parse this pdf? 如何解析这个PDF文件?

How to parse this pdf ? 如何解析这个PDF文件

Short of OCR'ing it you don't. 缺少OCR,您不需要。

The PDF in question does not contain the information required to extract text without doing at least some OCR (at least OCR'ing each character of the used font to find a mapping from glyph to character) which would require additional libraries and code. 所讨论的PDF不包含在不进行至少某些OCR(至少OCR对所使用字体的每个字符进行OCR查找从字形到字符的映射)的情况下提取文本所需的信息,这需要附加的库和代码。

As a requirement for text extraction the PDF specification ISO 32000-1:2008 correctly states in section 9.10.2 that the font used for the text to extract needs to 作为文本提取的要求,PDF规范ISO 32000-1:2008在9.10.2节中正确规定,用于提取文本的字体需要

  • either contain a ToUnicode CMap — the font used in your document doesn't — 包含ToUnicode CMap (文档中使用的字体不包含)
  • or be a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection — the font used in your document isn't — 或者是使用表118中列出的预定义CMap之一(Identity–H和Identity–V除外)或其后代CIDFont使用Adobe-GB1,Adobe-CNS1,Adobe-Japan1或Adobe-Korea1字符集的复合字体。 -文档中使用的字体不是-
  • or be a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font — the font used in your document neither uses one of those predefined encodings nor are the character names in its Differences array from those selections mentioned: the names used are /0 , /1 , ..., /155 . 或者是使用预定义编码MacRomanEncoding,MacExpertEncodingWinAnsiEncoding中的一种的简单字体或者是其Differences数组仅包含取自Adobe标准拉丁字符集的字符名称和Symbol字体中的命名字符集的编码 -文档中使用的字体既不使用这些预定义编码之一,也不在其Differences数组中的字符名称与所提到的那些选择相同:使用的名称为/ 0/ 1 ,..., / 155

Generally a good first test is to try and copy&paste text using Adobe Reader as much text extraction experience is in the Reader's code. 通常,一个良好的第一个测试是尝试使用Adobe Reader复制和粘贴文本,因为Reader的代码中包含大量的文本提取经验。 When trying to do so, you'll see that you only get garbage. 尝试这样做时,您会看到只得到垃圾。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM