[英]Spark Dataframe - Encoder
I am new to Scala and Spark.我是 Scala 和 Spark 的新手。
I am trying to use encoder to read a file from Spark and then convert to a java/scala object.我正在尝试使用编码器从 Spark 读取文件,然后转换为 java/scala object。
The first step to read the file applying a schema and encoding using as works fine.使用 as 读取应用架构和编码的文件的第一步工作正常。
Then I use that dataset/dataframe to do a simple map operation, but if I try to print the schema on the resultant dataset/dataframe it doesn't print any columns.然后我使用该数据集/数据框执行简单的 map 操作,但如果我尝试在生成的数据集/数据框上打印模式,它不会打印任何列。
Also, when i first read the file, i don't map age field in Person class, just to calculate it in the map function to try out - but I don't see that age not mapped to the data frame using Person at all. Also, when i first read the file, i don't map age field in Person class, just to calculate it in the map function to try out - but I don't see that age not mapped to the data frame using Person at all .
Data in Person.txt: Person.txt 中的数据:
firstName,lastName,dob
ABC, XYZ, 01/01/2019
CDE, FGH, 01/02/2020
The below is the code:以下是代码:
object EncoderExample extends App {
val sparkSession = SparkSession.builder().appName("EncoderExample").master("local").getOrCreate();
case class Person(firstName: String, lastName: String, dob: String,var age: Int = 10)
implicit val encoder = Encoders.bean[Person](classOf[Person])
val personDf = sparkSession.read.option("header","true").option("inferSchema","true").csv("Person.txt").as(encoder)
personDf.printSchema()
personDf.show()
val calAge = personDf.map(p => {
p.age = Year.now().getValue - p.dob.substring(6).toInt
println(p.age)
p
} )//.toDF()//.as(encoder)
print("*********Person DF Schema after age calculation: ")
calAge.printSchema()
//calAge.show
}
package spark
import java.text.SimpleDateFormat
import java.util.Calendar
import org.apache.spark.sql.{SparkSession}
import org.apache.spark.sql.functions._
case class Person(firstName: String, lastName: String, dob: String, age: Long)
object CalcAge extends App {
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
val sourceDF = Seq(
("ABC", "XYZ", "01/01/2019"),
("CDE", "FGH", "01/02/2020")
).toDF("firstName","lastName","dob")
sourceDF.printSchema
// root
// |-- firstName: string (nullable = true)
// |-- lastName: string (nullable = true)
// |-- dob: string (nullable = true)
sourceDF.show(false)
// +---------+--------+----------+
// |firstName|lastName|dob |
// +---------+--------+----------+
// |ABC |XYZ |01/01/2019|
// |CDE |FGH |01/02/2020|
// +---------+--------+----------+
def getCurrentYear: Long = {
val today:java.util.Date = Calendar.getInstance.getTime
val timeFormat = new SimpleDateFormat("yyyy")
timeFormat.format(today).toLong
}
val ageUDF = udf((d1: String) => {
val year = d1.split("/").reverse.head.toLong
val yearNow = getCurrentYear
yearNow - year
})
val df = sourceDF
.withColumn("age", ageUDF('dob))
df.printSchema
// root
// |-- firstName: string (nullable = true)
// |-- lastName: string (nullable = true)
// |-- dob: string (nullable = true)
// |-- age: long (nullable = false)
df.show(false)
// +---------+--------+----------+---+
// |firstName|lastName|dob |age|
// +---------+--------+----------+---+
// |ABC |XYZ |01/01/2019|1 |
// |CDE |FGH |01/02/2020|0 |
// +---------+--------+----------+---+
val person = df.as[Person].collectAsList()
// person: java.util.List[Person] = [Person(ABC,XYZ,01/01/2019,1), Person(CDE,FGH,01/02/2020,0)]
println(person)
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.