简体   繁体   English

如何使用 Spark/Scala 在 HDFS 上编写/创建 zip 文件?

[英]How to write/create zip files on HDFS using Spark/Scala?

I need to write a Spark/Scala function in Apache Zeppelin that simply puts some files that are already present in an HDFS folder into a zip or gzip archive (or some common archive format that is easy to extract in Windows) in the same folder. I need to write a Spark/Scala function in Apache Zeppelin that simply puts some files that are already present in an HDFS folder into a zip or gzip archive (or some common archive format that is easy to extract in Windows) in the same folder. How would I do this please?请问我该怎么做? Would it be a Java call?会是 Java 电话吗? I see there's something called ZipOutputStream, is that the right approach?我看到有一个叫做 ZipOutputStream 的东西,这是正确的方法吗? Any tips appreciated.任何提示表示赞赏。

Thanks谢谢

Spark does not support reading/writing from zip directly, so using the ZipOutputStream is basically the only approach. Spark 不支持直接从 zip 读取/写入,因此使用ZipOutputStream基本上是唯一的方法。

Here's the code I used to compress my existing data via spark.这是我用来通过 spark 压缩现有数据的代码。 It recursively lists directory for files and then proceeds to compress them.它递归地列出文件的目录,然后继续压缩它们。 This code does not preserve directory structure, but keeps file names.此代码不保留目录结构,但保留文件名。

Input directory:输入目录:

unzipped/
├── part-00001
├── part-00002
└── part-00003

0 directories, 3 files

Output directory: Output目录:

zipped/
├── part-00001.zip
├── part-00002.zip
└── part-00003.zip

0 directories, 3 files

ZipPacker.scala: ZipPacker.scala:

package com.haodemon.spark.compression

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.io.IOUtils
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}

import java.io.FileOutputStream
import java.util.zip.{ZipEntry, ZipOutputStream}


object ZipPacker extends Serializable {

  private def getSparkContext: SparkContext = {
    val conf: SparkConf = new SparkConf()
      .setAppName("local")
      .setMaster("local[*]")
    SparkSession.builder().config(conf).getOrCreate().sparkContext
  }

  // recursively list files in a filesystem
  private def listFiles(fs: FileSystem, path: Path): List[Path] = {
    fs.listStatus(path).flatMap(p =>
        if (p.isDirectory) listFiles(fs, p.getPath)
        else List(p.getPath)
      ).toList
  }

  // zip compress file one by one in parallel
  private def zip(inputPath: Path, outputDirectory: Path): Unit = {
    val outputPath = {
      val name = inputPath.getName + ".zip"
      outputDirectory + "/" + name
    }
    println(s"Zipping to $outputPath")

    val zipStream = {
      val out = new FileOutputStream(outputPath)
      val zip = new ZipOutputStream(out)
      val entry = new ZipEntry(inputPath.getName)
      zip.putNextEntry(entry)
      // max compression
      zip.setLevel(9)
      zip
    }

    val conf = new Configuration
    val uncompressedStream = inputPath.getFileSystem(conf).open(inputPath)
    val close = true
    IOUtils.copyBytes(uncompressedStream, zipStream, conf, close)
  }

  def main(args: Array[String]): Unit = {
    val input = new Path(args(0))
    println(s"Using input path $input")

    val sc = getSparkContext
    val uncompressedFiles = {
      val conf = sc.hadoopConfiguration
      val fs = input.getFileSystem(conf)
      listFiles(fs, input)
    }
    val rdd = sc.parallelize(uncompressedFiles)

    val output = new Path(args(1))
    println(s"Using output path $output")

    rdd.foreach(unzipped => zip(unzipped, output))
  }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM