简体   繁体   中英

Spark load Z compressed file using Scala on Databricks

Is there a way to read a .Z (capital) file extension with Spark directly?

I know Scala with spark can read gzip files (.gz) directly, but when I try to load a compressed Z file (.Z) into a Dataframe it doesn't work.

The reason why you can't read a file .Z is because Spark try to match the file extension with registered compression codecs and no codec handlers the extension .Z !!

All you had to do is to extend GzipCodec and override the getDefaultExtension method.

As an example:

Here is our ZgzipCodec.scala

package codecs
import org.apache.hadoop.io.compress.GzipCodec
class ZgzipCodec extends GzipCodec{
    override def getDefaultExtension(): String = ".Z"

}
package tests

import org.apache.spark.sql.SparkSession

object ReadingGzipFromZExtension{
  val spark = SparkSession
    .builder()
    .appName("ReadingGzipFromZExtension")
    .master("local[*]")
    .config("spark.sql.shuffle.partitions", "4") //Change to a more reasonable default number of partitions for our data
    .config("spark.app.id", "ReadingGzipFromZExtension")  // To silence Metrics warning
    .config("spark.hadoop.io.compression.codecs", "codecs.ZgzipCodec") // Custom Codec that process .Z extensions as a common Gzip format
    .getOrCreate()

  val sc = spark.sparkContext

  def main(args: Array[String]): Unit = {

    val data = spark.read.csv("/path/file.Z")
    data.show()

    sc.stop()
    spark.stop()
  }
}

You could follow this link for further details: Reading compressed data with Spark using unknown file extensions

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM