简体   繁体   English

Spark Dataframe - 编码器

[英]Spark Dataframe - Encoder

I am new to Scala and Spark.我是 Scala 和 Spark 的新手。

I am trying to use encoder to read a file from Spark and then convert to a java/scala object.我正在尝试使用编码器从 Spark 读取文件,然后转换为 java/scala object。

The first step to read the file applying a schema and encoding using as works fine.使用 as 读取应用架构和编码的文件的第一步工作正常。

Then I use that dataset/dataframe to do a simple map operation, but if I try to print the schema on the resultant dataset/dataframe it doesn't print any columns.然后我使用该数据集/数据框执行简单的 map 操作,但如果我尝试在生成的数据集/数据框上打印模式,它不会打印任何列。

Also, when i first read the file, i don't map age field in Person class, just to calculate it in the map function to try out - but I don't see that age not mapped to the data frame using Person at all. Also, when i first read the file, i don't map age field in Person class, just to calculate it in the map function to try out - but I don't see that age not mapped to the data frame using Person at all .

Data in Person.txt: Person.txt 中的数据:

firstName,lastName,dob
ABC, XYZ, 01/01/2019
CDE, FGH, 01/02/2020

The below is the code:以下是代码:

object EncoderExample extends App {
  val sparkSession = SparkSession.builder().appName("EncoderExample").master("local").getOrCreate();

  case class Person(firstName: String, lastName: String, dob: String,var age: Int = 10)
  implicit val encoder = Encoders.bean[Person](classOf[Person])
  val personDf = sparkSession.read.option("header","true").option("inferSchema","true").csv("Person.txt").as(encoder)

  personDf.printSchema()
  personDf.show()

  val calAge = personDf.map(p => {
    p.age = Year.now().getValue - p.dob.substring(6).toInt
    println(p.age)
    p
  } )//.toDF()//.as(encoder)

  print("*********Person DF Schema after age calculation: ")
  calAge.printSchema()

  //calAge.show
}
package spark

import java.text.SimpleDateFormat
import java.util.Calendar

import org.apache.spark.sql.{SparkSession}
import org.apache.spark.sql.functions._

case class Person(firstName: String, lastName: String, dob: String, age: Long)

object CalcAge extends App {

  val spark = SparkSession.builder()
    .master("local")
    .appName("DataFrame-example")
    .getOrCreate()

  import spark.implicits._

  val sourceDF = Seq(
    ("ABC", "XYZ", "01/01/2019"),
    ("CDE", "FGH", "01/02/2020")
  ).toDF("firstName","lastName","dob")

  sourceDF.printSchema
  //  root
  //  |-- firstName: string (nullable = true)
  //  |-- lastName: string (nullable = true)
  //  |-- dob: string (nullable = true)

  sourceDF.show(false)
  //  +---------+--------+----------+
  //  |firstName|lastName|dob       |
  //  +---------+--------+----------+
  //  |ABC      |XYZ     |01/01/2019|
  //  |CDE      |FGH     |01/02/2020|
  //  +---------+--------+----------+


  def getCurrentYear: Long = {

    val today:java.util.Date = Calendar.getInstance.getTime
    val timeFormat = new SimpleDateFormat("yyyy")
    timeFormat.format(today).toLong

  }

  val ageUDF = udf((d1: String) => {

    val year = d1.split("/").reverse.head.toLong
    val yearNow = getCurrentYear
    yearNow - year
  })


  val df = sourceDF
    .withColumn("age", ageUDF('dob))
  df.printSchema
  //  root
  //  |-- firstName: string (nullable = true)
  //  |-- lastName: string (nullable = true)
  //  |-- dob: string (nullable = true)
  //  |-- age: long (nullable = false)

  df.show(false)
  //  +---------+--------+----------+---+
  //  |firstName|lastName|dob       |age|
  //  +---------+--------+----------+---+
  //  |ABC      |XYZ     |01/01/2019|1  |
  //  |CDE      |FGH     |01/02/2020|0  |
  //  +---------+--------+----------+---+

  val person = df.as[Person].collectAsList()
  //  person: java.util.List[Person] = [Person(ABC,XYZ,01/01/2019,1), Person(CDE,FGH,01/02/2020,0)]
  println(person)



}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM