Is there a way to read a .Z
(capital) file extension with Spark
directly?
I know Scala
with spark
can read gzip
files (.gz)
directly, but when I try to load a compressed Z
file (.Z)
into a Dataframe
it doesn't work.
The reason why you can't read a file .Z
is because Spark
try to match the file extension with registered compression codecs and no codec handlers the extension .Z
!!
All you had to do is to extend GzipCodec
and override the getDefaultExtension
method.
As an example:
Here is our ZgzipCodec.scala
package codecs
import org.apache.hadoop.io.compress.GzipCodec
class ZgzipCodec extends GzipCodec{
override def getDefaultExtension(): String = ".Z"
}
package tests
import org.apache.spark.sql.SparkSession
object ReadingGzipFromZExtension{
val spark = SparkSession
.builder()
.appName("ReadingGzipFromZExtension")
.master("local[*]")
.config("spark.sql.shuffle.partitions", "4") //Change to a more reasonable default number of partitions for our data
.config("spark.app.id", "ReadingGzipFromZExtension") // To silence Metrics warning
.config("spark.hadoop.io.compression.codecs", "codecs.ZgzipCodec") // Custom Codec that process .Z extensions as a common Gzip format
.getOrCreate()
val sc = spark.sparkContext
def main(args: Array[String]): Unit = {
val data = spark.read.csv("/path/file.Z")
data.show()
sc.stop()
spark.stop()
}
}
You could follow this link for further details: Reading compressed data with Spark using unknown file extensions
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.