JAVA - Convert PDF Byte array into Readable String

Question

I am trying to get the content text from a PDF attached to an eMail.

I am using EWS-JAVA-API to get the attachment

public void getAttachments(Item item)throws Exception{
    EmailMessage message = EmailMessage.bind(service, item.getId(), new PropertySet(BasePropertySet.FirstClassProperties, ItemSchema.MimeContent, EmailMessageSchema.Attachments ) );
    for(Attachment attachment:message.getAttachments()) {
        FileAttachment newAttachment =(FileAttachment) attachment;
        newAttachment.load();
        newAttachment.getFileName();
        newAttachment.getContentType();
        System.out.println(new String(newAttachment.getContent()));
    }
}

This however returns eg

"%PDF-1.4
%����
4 0 obj
<<
/Subject (label, DEFAULT format)
/Producer (Apache FOP Version 0.95)
/CreationDate (D:20161015002945+01'00')
\>\>
endobj
5 0 obj
<<
  /N 3
  /Length 12 0 R
  /Filter /FlateDecode
\>\>
stream
 ��e����mi ]�P����`/ ���u}q�|^R��,g+���\K�k)/����C_|�R����ax�8�t1C^7nfz�D����p�柇��u�$��/�ED˦L L��[���B�@�������ٹ����ЖX�!@~ (*   {d+��}�G�͋љ���ς�}W�L��$�cGD2�Q���Z4" ...

Above truncated for brevity.

Is there a way of converting this to readable text in code (without writing to the disc?

NOTE: I can create a PDF file from this using PDFbox, but from my understanding that needs to write to disk. I need to do this in memory.

Answer 1

You can try tika parser

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>LATEST_VERSION</version>
</dependency>

Example code

Tika tikaParser = new Tika();
tikaParser.setMaxStringLength(-1);
Metadata metadata = new Metadata();
InputStream inputStream = new ByteArrayInputStream(newAttachment.getContent());
String content = tikaParser.parseToString(inputStream, metadata);

JAVA - Convert PDF Byte array into Readable String

Question

1 answers

solution1
0 ACCPTED 2016-12-21 11:14:07

JAVA - Convert PDF Byte array into Readable String

Question

1 answers

solution1 0 ACCPTED 2016-12-21 11:14:07

solution1
0 ACCPTED 2016-12-21 11:14:07