Apache tika() returning empty string for pdf. Java

Question

I am trying to get content of documents using apache tika() function. I am able to get contents of .doc and .docx files, But it's not working on .pdf files. I didn't specified document type in code, but don't know why it's not working for .pdf files.

Here is my code:-

In extractDocument function:

    int indexedChars = -1;
    Metadata metadata = new Metadata();
    int experiance=0;
    String parsedContent;
     parsedContent = tika().parseToString(new BytesStreamInput(
                Base64.decode(document.getContent().getBytes()), false), metadata, indexedChars);
    System.out.println("parsedContent "+parsedContent);

Here i am getting parsedContent an empty string. here is the function from which i am calling this.

public Document push(Document document, String userName,HttpServletRequest req)  {

    if (logger.isDebugEnabled()) logger.debug("push({})", document.getContent());
    if (document == null)
        return null;
    System.out.println("document.getContent() is "+ document.getContent()); 

    /*  
    if (document.getIndex() == null || document.getIndex().isEmpty()) {
        document.setIndex(SMDSearchProperties.INDEX_NAME);
    }
    if (document.getType() == null || document.getType().isEmpty()) {
        document.setType(SMDSearchProperties.INDEX_TYPE_DOC);
    }
     */
    getNodeClient(userName); 
    try {

        System.out.println("client is "+ userName); 
        IndexResponse response = client
                .prepareIndex(userName, document.getType(),
                        document.getId())
                .setSource(extractDocument(document)).execute()
                .actionGet();
        document.setId(response.getId());
    } catch (Exception e) {
        e.printStackTrace();
        logger.warn("Can not index document {}", document.getName());
        System.out.println("Can not index document {}"+ document.getName()+" e.getMessage() "+e.getMessage());
        //throw new RestAPIException("Can not index document : "+ document.getName() + ": "+e.getMessage());
    }
    if (logger.isDebugEnabled()) logger.debug("/push()={}", document);
    return document;
}

Answer 1

Got the solution from here

Error while parsing Binary Files... (mostly PDF)

Download these 3 jar files and copy them to your lib folder and add them to project.

fontbox-1.5.0.jar 
jempbox-1.5.0.jar 
pdfbox-1.5.0.jar

Apache tika() returning empty string for pdf. Java

Question

1 answers

solution1
0 2015-07-08 06:01:04

Apache tika() returning empty string for pdf. Java

Question

1 answers

solution1 0 2015-07-08 06:01:04

solution1
0 2015-07-08 06:01:04