I am reading orc file with following data
My column c1 should be int and c2 should be string but spark is interpreting the c2 as decimal. I tried following code to overcome it
spark.read.option("inferSchema","false").option("header", "true").orc("path to file")
But spark orc reader still reads the data with schema even though I force it to turn off the inferschema. Is there a way to force spark not to read the schema and I apply my custom schema later after the read?
import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.sql.{SparkSession}
// spark: SparkSession
import spark.implicits._
when C2 is string
val pathORC =
"<path>/source.orc"
case class O(C1: Int, C2: String)
val source = Seq(O(1, "1954E7")).toDF()
source.printSchema()
// root
// |-- C1: integer (nullable = false)
// |-- C2: string (nullable = true)
source.show(false)
// +---+------+
// |C1 |C2 |
// +---+------+
// |1 |1954E7|
// +---+------+
source.write.mode("overwrite").orc(pathORC)
val res = spark.read.orc(pathORC)
res.printSchema()
// root
// |-- C1: integer (nullable = true)
// |-- C2: string (nullable = true)
res.show(false)
// +---+------+
// |C1 |C2 |
// +---+------+
// |1 |1954E7|
// +---+------+
when C2 ???
val pathORC1 =
"<path>/source1.orc"
val source1 = Seq((1, 1954e7)).toDF("C1", "C2")
source1.printSchema()
// root
// |-- C1: integer (nullable = false)
// |-- C2: double (nullable = false)
source1.show(false)
// +---+--------+
// |C1 |C2 |
// +---+--------+
// |1 |1.954E10|
// +---+--------+
source1.write.mode("overwrite").orc(pathORC1)
val res1 = spark.read.orc(pathORC1)
res1.printSchema()
// root
// |-- C1: integer (nullable = true)
// |-- C2: double (nullable = true)
res1.show(false)
// +---+--------+
// |C1 |C2 |
// +---+--------+
// |1 |1.954E10|
// +---+--------+
val dToStr = udf( (v: Double) => { v.toString.replace(".", "") } )
val res2 = res1
.withColumn("C2", dToStr(col("C2")))
res2.printSchema()
// root
// |-- C1: integer (nullable = true)
// |-- C2: string (nullable = true)
res2.show(false)
// +---+-------+
// |C1 |C2 |
// +---+-------+
// |1 |1954E10|
// +---+-------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.