[英]PDFTextStripper parsing with wrong encoding
PDFTextStripper stripper = new PDFText2HTML(encoding);
String result = stripper.getText(document).trim();
result contains something like 结果包含类似
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd"> <html><head><title>Inserat
SeLe EE rev</title> <meta http-equiv="Content-Type"
content="text/html; charset=utf-8"> </head> <body> <div
style="page-break-before:always;
page-break-after:always"><div><p>�&#...
instead of 代替
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd"> <html><head><title>Inserat
SeLe EE rev</title> <meta http-equiv="Content-Type"
content="text/html; charset=utf-8"> </head> <body> <div
style="page-break-before:always; page-break-after:always"><div><p>any
blablabla characters...
When I changed encoding to windows-1252 or utf-8 result not changed. 当我将编码更改为Windows-1252或utf-8时,结果未更改。 Bad pdf url http://www.permaco.ch/fileadmin/user_upload/jobs/Inserat_SeLe_EE_rev.pdf 错误的pdf网址http://www.permaco.ch/fileadmin/user_upload/jobs/Inserat_SeLe_EE_rev.pdf
How to parse this pdf? 如何解析这个PDF文件?
Short of OCR'ing it you don't. 缺少OCR,您不需要。
The PDF in question does not contain the information required to extract text without doing at least some OCR (at least OCR'ing each character of the used font to find a mapping from glyph to character) which would require additional libraries and code. 所讨论的PDF不包含在不进行至少某些OCR(至少OCR对所使用字体的每个字符进行OCR查找从字形到字符的映射)的情况下提取文本所需的信息,这需要附加的库和代码。
As a requirement for text extraction the PDF specification ISO 32000-1:2008 correctly states in section 9.10.2 that the font used for the text to extract needs to 作为文本提取的要求,PDF规范ISO 32000-1:2008在9.10.2节中正确规定,用于提取文本的字体需要
Generally a good first test is to try and copy&paste text using Adobe Reader as much text extraction experience is in the Reader's code. 通常,一个良好的第一个测试是尝试使用Adobe Reader复制和粘贴文本,因为Reader的代码中包含大量的文本提取经验。 When trying to do so, you'll see that you only get garbage. 尝试这样做时,您会看到只得到垃圾。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.