简体   繁体   中英

Read Content from Files which are inside Zip file

I am trying to create a simple java program which reads and extracts the content from the file(s) inside zip file. Zip file contains 3 files (txt, pdf, docx). I need to read the contents of all these files and I am using Apache Tika for this purpose.

Can somebody help me out here to achieve the functionality. I have tried this so far but no success

Code Snippet

public class SampleZipExtract {


    public static void main(String[] args) {

        List<String> tempString = new ArrayList<String>();
        StringBuffer sbf = new StringBuffer();

        File file = new File("C:\\Users\\xxx\\Desktop\\abc.zip");
        InputStream input;
        try {

          input = new FileInputStream(file);
          ZipInputStream zip = new ZipInputStream(input);
          ZipEntry entry = zip.getNextEntry();

          BodyContentHandler textHandler = new BodyContentHandler();
          Metadata metadata = new Metadata();

          Parser parser = new AutoDetectParser();

          while (entry!= null){

                if(entry.getName().endsWith(".txt") || 
                           entry.getName().endsWith(".pdf")||
                           entry.getName().endsWith(".docx")){
              System.out.println("entry=" + entry.getName() + " " + entry.getSize());
                     parser.parse(input, textHandler, metadata, new ParseContext());
                     tempString.add(textHandler.toString());
                }
           }
           zip.close();
           input.close();

           for (String text : tempString) {
           System.out.println("Apache Tika - Converted input string : " + text);
           sbf.append(text);
           System.out.println("Final text from all the three files " + sbf.toString());
        } catch (FileNotFoundException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (SAXException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (TikaException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
}

If you're wondering how to get the file content from each ZipEntry it's actually quite simple. Here's a sample code:

public static void main(String[] args) throws IOException {
    ZipFile zipFile = new ZipFile("C:/test.zip");

    Enumeration<? extends ZipEntry> entries = zipFile.entries();

    while(entries.hasMoreElements()){
        ZipEntry entry = entries.nextElement();
        InputStream stream = zipFile.getInputStream(entry);
    }
}

Once you have the InputStream you can read it however you want.

As of Java 7, the NIO АР I provides a better and more generic way of accessing the contents of ZIP or JAR files. Actually, it is now a unified API which allows you to treat ZIP files exactly like normal files.

In order to extract all of the files contained inside of a ZIP file in this API, you'd do as shown below.

In Java 8

private void extractAll(URI fromZip, Path toDirectory) throws IOException {
    FileSystems.newFileSystem(fromZip, Collections.emptyMap())
            .getRootDirectories()
            .forEach(root -> {
                // in a full implementation, you'd have to
                // handle directories 
                Files.walk(root).forEach(path -> Files.copy(path, toDirectory));
            });
}

In Java 7

private void extractAll(URI fromZip, Path toDirectory) throws IOException {
    FileSystem zipFs = FileSystems.newFileSystem(fromZip, Collections.emptyMap());

    for (Path root : zipFs.getRootDirectories()) {
        Files.walkFileTree(root, new SimpleFileVisitor<Path>() {
            @Override
            public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) 
                    throws IOException {
                // You can do anything you want with the path here
                Files.copy(file, toDirectory);
                return FileVisitResult.CONTINUE;
            }

            @Override
            public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs) 
                    throws IOException {
                // In a full implementation, you'd need to create each 
                // sub-directory of the destination directory before 
                // copying files into it
                return super.preVisitDirectory(dir, attrs);
            }
        });
    }
}

Because of the condition in while , the loop might never break:

while (entry != null) {
  // If entry never becomes null here, loop will never break.
}

Instead of the null check there, you can try this:

ZipEntry entry = null;
while ((entry = zip.getNextEntry()) != null) {
  // Rest of your code
}

Sample code you can use to let Tika take care of container files for you. http://wiki.apache.org/tika/RecursiveMetadata

Form what I can tell, the accepted solution will not work for cases where there are nested zip files. Tika, however will take care of such situations as well.

My way of achieving this is by creating ZipInputStream wrapping class that would handle that would provide only the stream of current entry:

The wrapper class:

public class ZippedFileInputStream extends InputStream {

    private ZipInputStream is;

    public ZippedFileInputStream(ZipInputStream is){
        this.is = is;
    }

    @Override
    public int read() throws IOException {
        return is.read();
    }

    @Override
    public void close() throws IOException {
        is.closeEntry();
    }

}

The use of it:

    ZipInputStream zipInputStream = new ZipInputStream(new FileInputStream("SomeFile.zip"));

    while((entry = zipInputStream.getNextEntry())!= null) {

     ZippedFileInputStream archivedFileInputStream = new ZippedFileInputStream(zipInputStream);

     //... perform whatever logic you want here with ZippedFileInputStream 

     // note that this will only close the current entry stream and not the ZipInputStream
     archivedFileInputStream.close();

    }
    zipInputStream.close();

One advantage of this approach: InputStreams are passed as an arguments to methods that process them and those methods have a tendency to immediately close the input stream after they are done with it.

i did mine like this and remember to change url or zip files jdk 15

import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.Scanner;
import java.util.stream.Stream;
import java.util.zip.ZipEntry;
import java.util.zip.ZipFile;
import java.io.*;
import java.util.*;
import java.nio.file.Paths;

class Main {
  public static void main(String[] args) throws MalformedURLException,FileNotFoundException,IOException{
    String url,kfile;
    Scanner getkw = new Scanner(System.in);
    System.out.println(" Please Paste Url ::");
    url = getkw.nextLine();
    System.out.println("Please enter name of file you want to save as :: ");
    kfile = getkw.nextLine();
    getkw.close();
    Main Dinit = new Main();
    System.out.println(Dinit.dloader(url, kfile));
    ZipFile Vanilla = new ZipFile(new File("Vanilla.zip"));
    Enumeration<? extends ZipEntry> entries = Vanilla.entries();

    while(entries.hasMoreElements()){
        ZipEntry entry = entries.nextElement();
//        String nextr =  entries.nextElement();
        InputStream stream = Vanilla.getInputStream(entry);
        FileInputStream inpure= new FileInputStream("Vanilla.zip");
        FileOutputStream outter = new FileOutputStream(new File(entry.toString()));
        outter.write(inpure.readAllBytes());
        outter.close();
    }

  }
  private String dloader(String kurl, String fname)throws IOException{
    String status ="";
    try {
      URL url = new URL("URL here");
      FileOutputStream out = new FileOutputStream(new File("Vanilla.zip"));         // Output File
      out.write(url.openStream().readAllBytes());
      out.close();
    } catch (MalformedURLException e) {
      status = "Status: MalformedURLException Occured";
    }catch (IOException e) {
      status = "Status: IOexception Occured";
    }finally{
      status = "Status: Good";}
    String path="\\tkwgter5834\\";
    extractor(fname,"tkwgter5834",path);
    

    return status;
  }
  private String extractor(String fname,String dir,String path){
    File folder = new File(dir);
    if(!folder.exists()){
      folder.mkdir();
    }
    return "";
  }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM