簡體   English   中英

Spark Dataframe - 編碼器

[英]Spark Dataframe - Encoder

我是 Scala 和 Spark 的新手。

我正在嘗試使用編碼器從 Spark 讀取文件,然后轉換為 java/scala object。

使用 as 讀取應用架構和編碼的文件的第一步工作正常。

然后我使用該數據集/數據框執行簡單的 map 操作,但如果我嘗試在生成的數據集/數據框上打印模式,它不會打印任何列。

Also, when i first read the file, i don't map age field in Person class, just to calculate it in the map function to try out - but I don't see that age not mapped to the data frame using Person at all .

Person.txt 中的數據:

firstName,lastName,dob
ABC, XYZ, 01/01/2019
CDE, FGH, 01/02/2020

以下是代碼:

object EncoderExample extends App {
  val sparkSession = SparkSession.builder().appName("EncoderExample").master("local").getOrCreate();

  case class Person(firstName: String, lastName: String, dob: String,var age: Int = 10)
  implicit val encoder = Encoders.bean[Person](classOf[Person])
  val personDf = sparkSession.read.option("header","true").option("inferSchema","true").csv("Person.txt").as(encoder)

  personDf.printSchema()
  personDf.show()

  val calAge = personDf.map(p => {
    p.age = Year.now().getValue - p.dob.substring(6).toInt
    println(p.age)
    p
  } )//.toDF()//.as(encoder)

  print("*********Person DF Schema after age calculation: ")
  calAge.printSchema()

  //calAge.show
}
package spark

import java.text.SimpleDateFormat
import java.util.Calendar

import org.apache.spark.sql.{SparkSession}
import org.apache.spark.sql.functions._

case class Person(firstName: String, lastName: String, dob: String, age: Long)

object CalcAge extends App {

  val spark = SparkSession.builder()
    .master("local")
    .appName("DataFrame-example")
    .getOrCreate()

  import spark.implicits._

  val sourceDF = Seq(
    ("ABC", "XYZ", "01/01/2019"),
    ("CDE", "FGH", "01/02/2020")
  ).toDF("firstName","lastName","dob")

  sourceDF.printSchema
  //  root
  //  |-- firstName: string (nullable = true)
  //  |-- lastName: string (nullable = true)
  //  |-- dob: string (nullable = true)

  sourceDF.show(false)
  //  +---------+--------+----------+
  //  |firstName|lastName|dob       |
  //  +---------+--------+----------+
  //  |ABC      |XYZ     |01/01/2019|
  //  |CDE      |FGH     |01/02/2020|
  //  +---------+--------+----------+


  def getCurrentYear: Long = {

    val today:java.util.Date = Calendar.getInstance.getTime
    val timeFormat = new SimpleDateFormat("yyyy")
    timeFormat.format(today).toLong

  }

  val ageUDF = udf((d1: String) => {

    val year = d1.split("/").reverse.head.toLong
    val yearNow = getCurrentYear
    yearNow - year
  })


  val df = sourceDF
    .withColumn("age", ageUDF('dob))
  df.printSchema
  //  root
  //  |-- firstName: string (nullable = true)
  //  |-- lastName: string (nullable = true)
  //  |-- dob: string (nullable = true)
  //  |-- age: long (nullable = false)

  df.show(false)
  //  +---------+--------+----------+---+
  //  |firstName|lastName|dob       |age|
  //  +---------+--------+----------+---+
  //  |ABC      |XYZ     |01/01/2019|1  |
  //  |CDE      |FGH     |01/02/2020|0  |
  //  +---------+--------+----------+---+

  val person = df.as[Person].collectAsList()
  //  person: java.util.List[Person] = [Person(ABC,XYZ,01/01/2019,1), Person(CDE,FGH,01/02/2020,0)]
  println(person)



}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM