如何使用Spark Java解压缩HDFS中存储的文件

Question

List<String> list= jsc.wholeTextFiles(hdfsPath).keys().collect();
        for (String string : list) {
        System.out.println(string);
        }

我在这里获取所有zip文件。从这里我无法继续如何提取每个文件并使用相同的zipname文件夹存储到hdfs路径中

Answer 1

使用gzip文件时，WholeTextFiles应该自动将所有内容压缩。 但是，对于zip文件，我知道的唯一方法是使用binaryFiles并手动解压缩数据。

sc
    .binaryFiles(hdfsDir)
    .mapValues(x=> { 
        var result = scala.collection.mutable.ArrayBuffer.empty[String]
        val zis = new ZipInputStream(x.open())
        var entry : ZipEntry = null
        while({entry = zis.getNextEntry();entry} != null) {
            val scanner = new Scanner(zis)
            while (sc.hasNextLine()) {result+=sc.nextLine()} 
        }
        zis.close()
        result
    }

这为您提供了（一对）RDD [String，ArrayBuffer [String]]，其中键是hdfs上文件的名称，值是zip文件的未压缩内容（ArrayBuffer的每个元素一行）。 如果给定的zip文件包含多个文件，则所有内容都会汇总。 您可以修改代码以适合您的确切需求。 例如，用flatMapValues代替mapValues可以使所有内容（RDD [String，String]）变平，以利用spark的并行性。

还要注意，在while条件中，“ {entry = is.getNextEntry（）; entry}可以用Java中的（entry = is.getNextEntry（））替换。但是在Scala中，受影响的结果是Unit，所以这会产生无限循环

Answer 2

您可以像下面这样使用，但是在将内容写入hdfs之前，我们只需要在zipFilesRdd.collect().forEach进行收集。 映射和平面映射使任务此时无法序列化。

public void readWriteZipContents(String zipLoc,String hdfsBasePath){
    JavaSparkContext jsc = new JavaSparkContext(new SparkContext(new SparkConf()));
    JavaPairRDD<String, PortableDataStream> zipFilesRdd = jsc.binaryFiles(zipLoc);
    zipFilesRdd.collect().forEach(file -> {
        ZipInputStream zipStream = new ZipInputStream(file._2.open());
        ZipEntry zipEntry = null;
        Scanner sc = new Scanner(zipStream);
        try {
            while ((zipEntry = zipStream.getNextEntry()) != null) {
                String entryName = zipEntry.getName();
                if (!zipEntry.isDirectory()) {
                    //create the path in hdfs and write its contents
                   Configuration configuration = new Configuration();
                    configuration.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
                    configuration.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName());
                    FileSystem fs = FileSystem.get(URI.create("hdfs://localhost:8020"), configuration);
                    FSDataOutputStream hdfsfile = fs.create(new Path(hdfsBasePath + "/" + entryName));
                   while(sc.hasNextLine()){
                       hdfsfile.writeBytes(sc.nextLine());
                   }
                   hdfsfile.close();
                   hdfsfile.flush();
                }
                zipStream.closeEntry();
            }
        } catch (IllegalArgumentException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        sc.close();
        //return fileNames.iterator();
    });
}

Answer 3

提出用Scala编写的此解决方案。

已测试使用spark2（版本2.3.0.cloudera2），scala（版本2.11.8）

def extractHdfsZipFile(source_zip : String, target_folder : String,
    sparksession : SparkSession) : Boolean = {

    val hdfs_config = sparksession.sparkContext.hadoopConfiguration
    val buffer = new Array[Byte](1024)

    /*
     .collect -> run on driver only, not able to serialize hdfs Configuration
    */
    val zip_files = sparksession.sparkContext.binaryFiles(source_zip).collect.
      foreach{ zip_file: (String, PortableDataStream) =>
        // iterate over zip_files
        val zip_stream : ZipInputStream = new ZipInputStream(zip_file._2.open)
        var zip_entry: ZipEntry = null

        try {
          // iterate over all ZipEntry from ZipInputStream
          while ({zip_entry = zip_stream.getNextEntry; zip_entry != null}) {
            // skip directory
            if (!zip_entry.isDirectory()) {
              println(s"Extract File: ${zip_entry.getName()}, with Size: ${zip_entry.getSize()}")
              // create new hdfs file
              val fs : FileSystem = FileSystem.get(hdfs_config)
              val hdfs_file : FSDataOutputStream = fs.create(new Path(target_folder + "/" + zip_entry.getName()))

              var len : Int = 0
              // write until zip_stream is null
              while({len = zip_stream.read(buffer); len > 0}) {
                hdfs_file.write(buffer, 0, len)
              }
              // close and flush hdfs_file
              hdfs_file.close()
              hdfs_file.flush()
            }
            zip_stream.closeEntry()
          }
          zip_stream.close()
        } catch {
          case zip : ZipException => {
            println(zip.printStackTrace)
            println("Please verify that you do not use compresstype9.")
            // for DEBUG throw exception
            //false
            throw zip
          }
          case e : Exception => {
            println(e.printStackTrace)
            // for DEBUG throw exception
            //false
            throw e
          }
        }
    }
    true
  }

如何使用Spark Java解压缩HDFS中存储的文件

问题描述

3 个解决方案

解决方案1
2 2017-12-08 15:45:06

解决方案2
2 已采纳 2017-12-09 01:08:41

解决方案3
0 2018-08-09 11:52:24

如何使用Spark Java解压缩HDFS中存储的文件

问题描述

3 个解决方案

解决方案1 2 2017-12-08 15:45:06

解决方案2 2 已采纳 2017-12-09 01:08:41

解决方案3 0 2018-08-09 11:52:24

解决方案1
2 2017-12-08 15:45:06

解决方案2
2 已采纳 2017-12-09 01:08:41

解决方案3
0 2018-08-09 11:52:24