Spark load Z compressed file using Scala on Databricks

Question

Is there a way to read a .Z (capital) file extension with Spark directly?

I know Scala with spark can read gzip files (.gz) directly, but when I try to load a compressed Z file (.Z) into a Dataframe it doesn't work.

Answer 1

The reason why you can't read a file .Z is because Spark try to match the file extension with registered compression codecs and no codec handlers the extension .Z !!

All you had to do is to extend GzipCodec and override the getDefaultExtension method.

As an example:

Here is our ZgzipCodec.scala

package codecs
import org.apache.hadoop.io.compress.GzipCodec
class ZgzipCodec extends GzipCodec{
    override def getDefaultExtension(): String = ".Z"

}

package tests

import org.apache.spark.sql.SparkSession

object ReadingGzipFromZExtension{
  val spark = SparkSession
    .builder()
    .appName("ReadingGzipFromZExtension")
    .master("local[*]")
    .config("spark.sql.shuffle.partitions", "4") //Change to a more reasonable default number of partitions for our data
    .config("spark.app.id", "ReadingGzipFromZExtension")  // To silence Metrics warning
    .config("spark.hadoop.io.compression.codecs", "codecs.ZgzipCodec") // Custom Codec that process .Z extensions as a common Gzip format
    .getOrCreate()

  val sc = spark.sparkContext

  def main(args: Array[String]): Unit = {

    val data = spark.read.csv("/path/file.Z")
    data.show()

    sc.stop()
    spark.stop()
  }
}

You could follow this link for further details: Reading compressed data with Spark using unknown file extensions

Spark load Z compressed file using Scala on Databricks

Question

1 answers

solution1
2 2020-05-06 12:18:02

Spark load Z compressed file using Scala on Databricks

Question

1 answers

solution1 2 2020-05-06 12:18:02

solution1
2 2020-05-06 12:18:02