[英]How to write/create zip files on HDFS using Spark/Scala?
I need to write a Spark/Scala function in Apache Zeppelin that simply puts some files that are already present in an HDFS folder into a zip or gzip archive (or some common archive format that is easy to extract in Windows) in the same folder. I need to write a Spark/Scala function in Apache Zeppelin that simply puts some files that are already present in an HDFS folder into a zip or gzip archive (or some common archive format that is easy to extract in Windows) in the same folder. How would I do this please?
请问我该怎么做? Would it be a Java call?
会是 Java 电话吗? I see there's something called ZipOutputStream, is that the right approach?
我看到有一个叫做 ZipOutputStream 的东西,这是正确的方法吗? Any tips appreciated.
任何提示表示赞赏。
Thanks谢谢
Spark does not support reading/writing from zip directly, so using the ZipOutputStream
is basically the only approach. Spark 不支持直接从 zip 读取/写入,因此使用
ZipOutputStream
基本上是唯一的方法。
Here's the code I used to compress my existing data via spark.这是我用来通过 spark 压缩现有数据的代码。 It recursively lists directory for files and then proceeds to compress them.
它递归地列出文件的目录,然后继续压缩它们。 This code does not preserve directory structure, but keeps file names.
此代码不保留目录结构,但保留文件名。
Input directory:输入目录:
unzipped/
├── part-00001
├── part-00002
└── part-00003
0 directories, 3 files
Output directory: Output目录:
zipped/
├── part-00001.zip
├── part-00002.zip
└── part-00003.zip
0 directories, 3 files
ZipPacker.scala: ZipPacker.scala:
package com.haodemon.spark.compression
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.io.IOUtils
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}
import java.io.FileOutputStream
import java.util.zip.{ZipEntry, ZipOutputStream}
object ZipPacker extends Serializable {
private def getSparkContext: SparkContext = {
val conf: SparkConf = new SparkConf()
.setAppName("local")
.setMaster("local[*]")
SparkSession.builder().config(conf).getOrCreate().sparkContext
}
// recursively list files in a filesystem
private def listFiles(fs: FileSystem, path: Path): List[Path] = {
fs.listStatus(path).flatMap(p =>
if (p.isDirectory) listFiles(fs, p.getPath)
else List(p.getPath)
).toList
}
// zip compress file one by one in parallel
private def zip(inputPath: Path, outputDirectory: Path): Unit = {
val outputPath = {
val name = inputPath.getName + ".zip"
outputDirectory + "/" + name
}
println(s"Zipping to $outputPath")
val zipStream = {
val out = new FileOutputStream(outputPath)
val zip = new ZipOutputStream(out)
val entry = new ZipEntry(inputPath.getName)
zip.putNextEntry(entry)
// max compression
zip.setLevel(9)
zip
}
val conf = new Configuration
val uncompressedStream = inputPath.getFileSystem(conf).open(inputPath)
val close = true
IOUtils.copyBytes(uncompressedStream, zipStream, conf, close)
}
def main(args: Array[String]): Unit = {
val input = new Path(args(0))
println(s"Using input path $input")
val sc = getSparkContext
val uncompressedFiles = {
val conf = sc.hadoopConfiguration
val fs = input.getFileSystem(conf)
listFiles(fs, input)
}
val rdd = sc.parallelize(uncompressedFiles)
val output = new Path(args(1))
println(s"Using output path $output")
rdd.foreach(unzipped => zip(unzipped, output))
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.