简体   繁体   中英

Reading bytes from compressed PDF file in Java

When I try to read bytes from a normal PDF file into a byte array using "read" function in Java, the byte array is loaded correctly with size same as that of original PDF file.

Path file_path = Paths.get("D:\\Zip Test Client", "vadClient1.pdf");
    byte[] ByteArray= Files.readAllBytes(file_path);
    FileOutputStream fos = new FileOutputStream(new File("E:\\newFinalPDF.pdf"));

But when I read bytes from the same PDF file located inside a zipped folder, the read function reads only 8843 bytes (original size is 194471) and rest all are 0.

zipFile = new ZipFile(new File("D:\\Zip test Server\\ZipTestFolderOnServer.zip"));
        long count = zipFile.size();
        Enumeration<? extends ZipEntry> entries = zipFile.entries();
        while(entries.hasMoreElements()){

            System.out.println("New File starting");
            ZipEntry zipEntry = entries.nextElement();
            System.out.println(zipEntry.getName());
            InputStream fis =  zipFile.getInputStream(zipEntry); 
            byte[] fileToBytes = new byte[(int)zipEntry.getSize()];


            FileOutputStream fos = new FileOutputStream(new File("E:\\ContentZipped_"   + zipEntry.getName()));
            fis.read(fileToBytes);
            fos.write(fileToBytes);
            fis.close();
            Thread.sleep(1000);
            --count;
        }

What is the explanation to this behavior?

EDIT 1:- I am not looking for third party integrations such as Tika or POI.

Let's make it less error prone (and less memory consuming) by simplifying the code, use this to copy the content of your zip entry:

try (InputStream fis =  zipFile.getInputStream(zipEntry)) {
    Files.copy(fis, Paths.get("E:\\ContentZipped_"   + zipEntry.getName()));
}

public class SampleZipExtract {

public static void main(String[] args) {

    List<String> tempString = new ArrayList<String>();
    StringBuffer sbf = new StringBuffer();

    File file = new File("C:\\Users\\xxx\\Desktop\\abc.zip");
    InputStream input;
    try {

      input = new FileInputStream(file);
      ZipInputStream zip = new ZipInputStream(input);
      ZipEntry entry = zip.getNextEntry();

      BodyContentHandler textHandler = new BodyContentHandler();
      Metadata metadata = new Metadata();

      Parser parser = new AutoDetectParser();

      while (entry!= null){

            if(entry.getName().endsWith(".txt") || 
                       entry.getName().endsWith(".pdf")||
                       entry.getName().endsWith(".docx")){
          System.out.println("entry=" + entry.getName() + " " + entry.getSize());
                 parser.parse(input, textHandler, metadata, new ParseContext());
                 tempString.add(textHandler.toString());
            }
       }
       zip.close();
       input.close();

       for (String text : tempString) {
       System.out.println("Apache Tika - Converted input string : " + text);
       sbf.append(text);
       System.out.println("Final text from all the three files " + sbf.toString());
    } catch (FileNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (SAXException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (TikaException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
}

}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM