[英]Scala Spark : Convert Double Column to Date Time Column in dataframe
[英]Convert Date Column to Age with Scala and Spark
我正在嘗試將數據集的一列轉換為真實年齡。 我在 Spark 中使用 Scala,我的項目在 IntelliJ 上。
這是示例數據集
TotalCost|BirthDate|Gender|TotalChildren|ProductCategoryName
1000||Male|2|Technology
2000|1957-03-06||3|Beauty
3000|1959-03-06|Male||Car
4000|1953-03-06|Male|2|
5000|1957-03-06|Female|3|Beauty
6000|1959-03-06|Male|4|Car
7000|1957-03-06|Female|3|Beauty
8000|1959-03-06|Male|4|Car
這是 Scala 的代碼
import org.apache.spark.sql.SparkSession
object DataFrameFromCSVFile2 {
def main(args:Array[String]):Unit= {
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExample")
.getOrCreate()
val filePath="src/main/resources/demodata.txt"
val df = spark.read.options(Map("inferSchema"->"true","delimiter"->"|","header"->"true")).csv(filePath).select("Gender", "BirthDate", "TotalCost", "TotalChildren", "ProductCategoryName")
val df2 = df
.filter("Gender is not null")
.filter("BirthDate is not null")
.filter("TotalChildren is not null")
.filter("ProductCategoryName is not null")
df2.show()
所以我試圖將 1957-03-06 轉換為 Column 中的 61 歲
任何想法都會有很大幫助
非常感謝
您可以使用內置函數——months_between() 或 datediff()。 看一下這個
scala> val df = Seq("1957-03-06","1959-03-06").toDF("date")
df: org.apache.spark.sql.DataFrame = [date: string]
scala> df.show(false)
+----------+
|date |
+----------+
|1957-03-06|
|1959-03-06|
+----------+
scala> df.withColumn("age",months_between(current_date,'date)/12).show
+----------+------------------+
| date| age|
+----------+------------------+
|1957-03-06|61.806451612500005|
|1959-03-06|59.806451612500005|
+----------+------------------+
scala> df.withColumn("age",datediff(current_date,'date)/365).show
+----------+-----------------+
| date| age|
+----------+-----------------+
|1957-03-06|61.85205479452055|
|1959-03-06|59.85205479452055|
+----------+-----------------+
scala>
這是在 UDF 中使用java.time
API 以及 Spark 的內置when/otherwise
進行空檢查的一種方法:
val currentAge = udf{ (dob: java.sql.Date) =>
import java.time.{LocalDate, Period}
Period.between(dob.toLocalDate, LocalDate.now).getYears
}
df.withColumn("CurrentAge", when($"BirthDate".isNotNull, currentAge($"BirthDate"))).
show(5)
// +------+-------------------+---------+-------------+-------------------+----------+
// |Gender| BirthDate|TotalCost|TotalChildren|ProductCategoryName|CurrentAge|
// +------+-------------------+---------+-------------+-------------------+----------+
// | Male| null| 1000| 2| Technology| null|
// | null|1957-03-06 00:00:00| 2000| 3| Beauty| 61|
// | Male|1959-03-06 00:00:00| 3000| null| Car| 59|
// | Male|1953-03-06 00:00:00| 4000| 2| null| 65|
// |Female|1957-03-06 00:00:00| 5000| 3| Beauty| 61|
// +------+-------------------+---------+-------------+-------------------+----------+
您可以使用 Java 日歷庫獲取您所在時區的當前日期以計算年齡。 你可以使用 udf 來做到這一點。 例如
import java.time.ZoneId
import java.util.Calendar
val data = Seq("1957-03-06","1959-03-06").toDF("date")
val ageudf = udf((inputDate:String)=>{
val format = new java.text.SimpleDateFormat("yyyy-MM-dd")
val birthDate = format.parse(inputDate).toInstant.atZone(ZoneId.systemDefault()).toLocalDate
val currentDate = Calendar.getInstance().getTime..toInstant.atZone(ZoneId.systemDefault()).toLocalDate
import java.time.Period
if((birthDate != null) && (currentDate != null)) Period.between(birthDate,currentDate).getYears
else 0
})
data.withColumn("age",ageUdf($"date")).show()
輸出將是:
date|age
1957-03-06|61
1959-03-06|59
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.