简体   繁体   English

Scala - 将函数应用于数据框列中的每个值

[英]Scala - Apply a function to each value in a dataframe column

I have a function that takes a LocalDate (it could take any other type) and returns a DataFrame , eg:我有一个函数,它采用LocalDate (它可以采用任何其他类型)并返回一个DataFrame ,例如:

def genDataFrame(refDate: LocalDate): DataFrame = {
  Seq(
    (refDate,refDate.minusDays(7)),
    (refDate.plusDays(3),refDate.plusDays(7))
  ).toDF("col_A","col_B")
}

genDataFrame(LocalDate.parse("2021-07-02")) output: genDataFrame(LocalDate.parse("2021-07-02"))输出:

+----------+----------+
|     col_A|     col_B|
+----------+----------+
|2021-07-02|2021-06-25|
|2021-07-05|2021-07-09|
+----------+----------+

I wanna apply this function to each element in a dataframe column (which contains, obviously, LocalDate values), such as:我想将此函数应用于数据框列(显然包含LocalDate值)中的每个元素,例如:

val myDate = LocalDate.parse("2021-07-02")

val df = Seq(
  (myDate),
  (myDate.plusDays(1)),
  (myDate.plusDays(3))
).toDF("date")

df : df

+----------+
|      date|
+----------+
|2021-07-02|
|2021-07-03|
|2021-07-05|
+----------+

Required output:所需输出:

+----------+----------+
|     col_A|     col_B|
+----------+----------+
|2021-07-02|2021-06-25|
|2021-07-05|2021-07-09|
|2021-07-03|2021-06-26|
|2021-07-06|2021-07-10|
|2021-07-05|2021-06-28|
|2021-07-08|2021-07-12|
+----------+----------+

How could I achieve that (without using collect )?我怎么能做到这一点(不使用collect )?

You can always convert your data frame to a lazily evaluated view and use Spark SQL:您始终可以将数据框转换为延迟评估的视图并使用 Spark SQL:

val df_2 = df.map(x => x.getDate(0).toLocalDate()).withColumnRenamed("value", "col_A")
.withColumn("col_B", col("col_A"))
df_2.createOrReplaceTempView("test")

With that you can create a view like this one:有了它,您可以创建一个像这样的视图:

+----------+----------+
|     col_A|     col_B|
+----------+----------+
|2021-07-02|2021-07-02|
|2021-07-03|2021-07-03|
|2021-07-05|2021-07-05|
+----------+----------+

And then you can use SQL wich I find more intuitive:然后你可以使用我觉得更直观的 SQL:

spark.sql(s"""SELECT col_A, date_add(col_B, -7) as col_B FROM test
UNION
SELECT date_add(col_A, 3), date_add(col_B, 7) as col_B FROM test""")
.show()

This gives your expected output as a DataFrame:这将您的预期输出作为 DataFrame 提供:

+----------+----------+
|     col_A|     col_B|
+----------+----------+
|2021-07-02|2021-06-25|
|2021-07-03|2021-06-26|
|2021-07-05|2021-06-28|
|2021-07-05|2021-07-09|
|2021-07-06|2021-07-10|
|2021-07-08|2021-07-12|
+----------+----------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM