I am trying to get content of documents using apache tika() function. I am able to get contents of .doc and .docx files, But it's not working on .pdf files. I didn't specified document type in code, but don't know why it's not working for .pdf files.
Here is my code:-
In extractDocument function:
int indexedChars = -1;
Metadata metadata = new Metadata();
int experiance=0;
String parsedContent;
parsedContent = tika().parseToString(new BytesStreamInput(
Base64.decode(document.getContent().getBytes()), false), metadata, indexedChars);
System.out.println("parsedContent "+parsedContent);
Here i am getting parsedContent an empty string. here is the function from which i am calling this.
public Document push(Document document, String userName,HttpServletRequest req) {
if (logger.isDebugEnabled()) logger.debug("push({})", document.getContent());
if (document == null)
return null;
System.out.println("document.getContent() is "+ document.getContent());
/*
if (document.getIndex() == null || document.getIndex().isEmpty()) {
document.setIndex(SMDSearchProperties.INDEX_NAME);
}
if (document.getType() == null || document.getType().isEmpty()) {
document.setType(SMDSearchProperties.INDEX_TYPE_DOC);
}
*/
getNodeClient(userName);
try {
System.out.println("client is "+ userName);
IndexResponse response = client
.prepareIndex(userName, document.getType(),
document.getId())
.setSource(extractDocument(document)).execute()
.actionGet();
document.setId(response.getId());
} catch (Exception e) {
e.printStackTrace();
logger.warn("Can not index document {}", document.getName());
System.out.println("Can not index document {}"+ document.getName()+" e.getMessage() "+e.getMessage());
//throw new RestAPIException("Can not index document : "+ document.getName() + ": "+e.getMessage());
}
if (logger.isDebugEnabled()) logger.debug("/push()={}", document);
return document;
}
Got the solution from here
Error while parsing Binary Files... (mostly PDF)
Download these 3 jar files and copy them to your lib folder and add them to project.
fontbox-1.5.0.jar
jempbox-1.5.0.jar
pdfbox-1.5.0.jar
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.