How to unzip the files stored in hdfs using spark java

Question

List<String> list= jsc.wholeTextFiles(hdfsPath).keys().collect();
        for (String string : list) {
        System.out.println(string);
        }

Here i am getting all the zip files.From here i am unable to proceed how to extract each file and store into hdfs path with same zipname folder

Answer 1

With gzip files, wholeTextFiles should gunzip everything automatically. With zip files however, the only way I know is to use binaryFiles and to unzip the data by hand.

sc
    .binaryFiles(hdfsDir)
    .mapValues(x=> { 
        var result = scala.collection.mutable.ArrayBuffer.empty[String]
        val zis = new ZipInputStream(x.open())
        var entry : ZipEntry = null
        while({entry = zis.getNextEntry();entry} != null) {
            val scanner = new Scanner(zis)
            while (sc.hasNextLine()) {result+=sc.nextLine()} 
        }
        zis.close()
        result
    }

This gives you a (pair) RDD[String, ArrayBuffer[String]] where the key is the name of the file on hdfs and the value the unzipped content of the zip file (one line per element of the ArrayBuffer). If a given zip file contains more than one file, everything is aggregated. You may adapt the code to fit your exact needs. For instance, flatMapValues instead of mapValues would flatten everything (RDD[String, String]) to take advantage of spark's parallelism.

Note also that in the while condition, "{entry = is.getNextEntry();entry} could be replaced by (entry = is.getNextEntry()) in java. In scala however the result of an affectation is Unit so this would yield an infinite loop.

Answer 2

You can use like below, But only thing we need to do collect at zipFilesRdd.collect().forEach before writing the contents into hdfs. Map and flat map gives task not serializable at this point.

public void readWriteZipContents(String zipLoc,String hdfsBasePath){
    JavaSparkContext jsc = new JavaSparkContext(new SparkContext(new SparkConf()));
    JavaPairRDD<String, PortableDataStream> zipFilesRdd = jsc.binaryFiles(zipLoc);
    zipFilesRdd.collect().forEach(file -> {
        ZipInputStream zipStream = new ZipInputStream(file._2.open());
        ZipEntry zipEntry = null;
        Scanner sc = new Scanner(zipStream);
        try {
            while ((zipEntry = zipStream.getNextEntry()) != null) {
                String entryName = zipEntry.getName();
                if (!zipEntry.isDirectory()) {
                    //create the path in hdfs and write its contents
                   Configuration configuration = new Configuration();
                    configuration.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
                    configuration.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName());
                    FileSystem fs = FileSystem.get(URI.create("hdfs://localhost:8020"), configuration);
                    FSDataOutputStream hdfsfile = fs.create(new Path(hdfsBasePath + "/" + entryName));
                   while(sc.hasNextLine()){
                       hdfsfile.writeBytes(sc.nextLine());
                   }
                   hdfsfile.close();
                   hdfsfile.flush();
                }
                zipStream.closeEntry();
            }
        } catch (IllegalArgumentException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        sc.close();
        //return fileNames.iterator();
    });
}

Answer 3

Come up with this solution written in Scala .

Tested with spark2 (version 2.3.0.cloudera2), scala (version 2.11.8)

def extractHdfsZipFile(source_zip : String, target_folder : String,
    sparksession : SparkSession) : Boolean = {

    val hdfs_config = sparksession.sparkContext.hadoopConfiguration
    val buffer = new Array[Byte](1024)

    /*
     .collect -> run on driver only, not able to serialize hdfs Configuration
    */
    val zip_files = sparksession.sparkContext.binaryFiles(source_zip).collect.
      foreach{ zip_file: (String, PortableDataStream) =>
        // iterate over zip_files
        val zip_stream : ZipInputStream = new ZipInputStream(zip_file._2.open)
        var zip_entry: ZipEntry = null

        try {
          // iterate over all ZipEntry from ZipInputStream
          while ({zip_entry = zip_stream.getNextEntry; zip_entry != null}) {
            // skip directory
            if (!zip_entry.isDirectory()) {
              println(s"Extract File: ${zip_entry.getName()}, with Size: ${zip_entry.getSize()}")
              // create new hdfs file
              val fs : FileSystem = FileSystem.get(hdfs_config)
              val hdfs_file : FSDataOutputStream = fs.create(new Path(target_folder + "/" + zip_entry.getName()))

              var len : Int = 0
              // write until zip_stream is null
              while({len = zip_stream.read(buffer); len > 0}) {
                hdfs_file.write(buffer, 0, len)
              }
              // close and flush hdfs_file
              hdfs_file.close()
              hdfs_file.flush()
            }
            zip_stream.closeEntry()
          }
          zip_stream.close()
        } catch {
          case zip : ZipException => {
            println(zip.printStackTrace)
            println("Please verify that you do not use compresstype9.")
            // for DEBUG throw exception
            //false
            throw zip
          }
          case e : Exception => {
            println(e.printStackTrace)
            // for DEBUG throw exception
            //false
            throw e
          }
        }
    }
    true
  }

How to unzip the files stored in hdfs using spark java

Question

3 answers

solution1
2 2017-12-08 15:45:06

solution2
2 ACCPTED 2017-12-09 01:08:41

solution3
0 2018-08-09 11:52:24

How to unzip the files stored in hdfs using spark java

Question

3 answers

solution1 2 2017-12-08 15:45:06

solution2 2 ACCPTED 2017-12-09 01:08:41

solution3 0 2018-08-09 11:52:24

solution1
2 2017-12-08 15:45:06

solution2
2 ACCPTED 2017-12-09 01:08:41

solution3
0 2018-08-09 11:52:24