[英]while parsing a pdf document using org.apache.pdfbox and java , '-' is converted as '?'
PDPDDocument pdfDoc = PDDocument.load(input);
PDFTextStripper stripper=new PDFTextStripper();
stripper.setSortByPosition(true);
stripper.fixedCharWidth =5;
String text= stripper.getText(pdfDoc);
When I observed the output for input string in the pdf file: 07‑Jul‑2014 / 7/ 2014
当我在pdf文件中观察到输入字符串的输出时:
07‑Jul‑2014 / 7/ 2014
Output of the above line is like this: 07?JUL?2014 / 7/ 2014
上一行的输出是这样的:
07?JUL?2014 / 7/ 2014
Here it is a work around (at least for me) :
String DD_MM_YYYY_DATE_FORMAT_REGEX="[0-9]{1,2}(\\?)[a-zA-Z]{3}(\\?)[0-9]{4}";//string length=10
String DD_MMM_YYYY_DATE_FORMAT_REGEX="[0-9]{1,2}(\\?)[a-zA-Z]{3}(\\?)[0-9]{4}";//length=11
if(wordArray[index].substring(0,11).matches(DD_MMM_YYYY_DATE_FORMAT_REGEX))
{
wordArray[index]=wordArray[index].replaceAll("\\?", "/");
}
if(wordArray[index].substring(0,10).matches(DD_MM_YYYY_DATE_FORMAT_REGEX))
{
wordArray[index]=wordArray[index].replaceAll("\\?", "/");
}
It looks like an encoding issue. 看起来像是编码问题。 Seeing as you won't share the PDF, I can only suggest trying the following:
看到您不会共享PDF,我只能建议尝试以下方法:
PDFTextStripper stripper=new PDFTextStripper("UTF-8");
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.