简体   繁体   English

使用 Scala 和 Spark 将日期列转换为年龄

[英]Convert Date Column to Age with Scala and Spark

I am trying to convert a Column of a Dataset to true Age.我正在尝试将数据集的一列转换为真实年龄。 I am using Scala with Spark and my project is on IntelliJ.我在 Spark 中使用 Scala,我的项目在 IntelliJ 上。

This is the sample dataset这是示例数据集

TotalCost|BirthDate|Gender|TotalChildren|ProductCategoryName
1000||Male|2|Technology
2000|1957-03-06||3|Beauty
3000|1959-03-06|Male||Car
4000|1953-03-06|Male|2|
5000|1957-03-06|Female|3|Beauty
6000|1959-03-06|Male|4|Car
7000|1957-03-06|Female|3|Beauty
8000|1959-03-06|Male|4|Car 

And this is the code of Scala这是 Scala 的代码

import org.apache.spark.sql.SparkSession

object DataFrameFromCSVFile2 {

def main(args:Array[String]):Unit= {

val spark: SparkSession = SparkSession.builder()
  .master("local[1]")
  .appName("SparkByExample")
  .getOrCreate()

val filePath="src/main/resources/demodata.txt"

val df = spark.read.options(Map("inferSchema"->"true","delimiter"->"|","header"->"true")).csv(filePath).select("Gender", "BirthDate", "TotalCost", "TotalChildren", "ProductCategoryName")

val df2 = df
  .filter("Gender is not null")
  .filter("BirthDate is not null")
  .filter("TotalChildren is not null")
  .filter("ProductCategoryName is not null")
df2.show()

So I am trying to convert the 1957-03-06 to an age like 61 in the Column所以我试图将 1957-03-06 转换为 Column 中的 61 岁

Any idea will help a lot任何想法都会有很大帮助

Thank you very much非常感谢

You can use the built-in functions - months_between() or datediff().您可以使用内置函数——months_between() 或 datediff()。 Check this out看一下这个

scala> val df = Seq("1957-03-06","1959-03-06").toDF("date")
df: org.apache.spark.sql.DataFrame = [date: string]

scala> df.show(false)
+----------+
|date      |
+----------+
|1957-03-06|
|1959-03-06|
+----------+

scala> df.withColumn("age",months_between(current_date,'date)/12).show
+----------+------------------+
|      date|               age|
+----------+------------------+
|1957-03-06|61.806451612500005|
|1959-03-06|59.806451612500005|
+----------+------------------+

scala> df.withColumn("age",datediff(current_date,'date)/365).show
+----------+-----------------+
|      date|              age|
+----------+-----------------+
|1957-03-06|61.85205479452055|
|1959-03-06|59.85205479452055|
+----------+-----------------+


scala>

Here's one way that uses the java.time API in an UDF along with Spark's built-in when/otherwise for null check:这是在 UDF 中使用java.time API 以及 Spark 的内置when/otherwise进行空检查的一种方法:

val currentAge = udf{ (dob: java.sql.Date) =>
  import java.time.{LocalDate, Period}
  Period.between(dob.toLocalDate, LocalDate.now).getYears
}

df.withColumn("CurrentAge", when($"BirthDate".isNotNull, currentAge($"BirthDate"))).
  show(5)
// +------+-------------------+---------+-------------+-------------------+----------+
// |Gender|          BirthDate|TotalCost|TotalChildren|ProductCategoryName|CurrentAge|
// +------+-------------------+---------+-------------+-------------------+----------+
// |  Male|               null|     1000|            2|         Technology|      null|
// |  null|1957-03-06 00:00:00|     2000|            3|             Beauty|        61|
// |  Male|1959-03-06 00:00:00|     3000|         null|                Car|        59|
// |  Male|1953-03-06 00:00:00|     4000|            2|               null|        65|
// |Female|1957-03-06 00:00:00|     5000|            3|             Beauty|        61|
// +------+-------------------+---------+-------------+-------------------+----------+

You can use the Java Calendar library to get the current date in your timezone to calculate the age.您可以使用 Java 日历库获取您所在时区的当前日期以计算年龄。 you can use udf to do that.你可以使用 udf 来做到这一点。 for example例如

import java.time.ZoneId
import java.util.Calendar

val data = Seq("1957-03-06","1959-03-06").toDF("date")

val ageudf = udf((inputDate:String)=>{

val format = new java.text.SimpleDateFormat("yyyy-MM-dd")
val birthDate = format.parse(inputDate).toInstant.atZone(ZoneId.systemDefault()).toLocalDate
val currentDate = Calendar.getInstance().getTime..toInstant.atZone(ZoneId.systemDefault()).toLocalDate
import java.time.Period
if((birthDate != null) && (currentDate != null)) Period.between(birthDate,currentDate).getYears
else 0
})

data.withColumn("age",ageUdf($"date")).show()

The output will be:输出将是:

date|age
1957-03-06|61
1959-03-06|59

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM