简体   繁体   English

使用数据框中多个其他列的值将新列添加到Dataframe - spark / scala

[英]Adding a new column to a Dataframe by using the values of multiple other columns in the dataframe - spark/scala

I am new to spark SQL and Dataframes. 我是新手来激发SQL和Dataframes。 I have a Dataframe to which I should be adding a new column based on the values of other columns. 我有一个Dataframe ,我应该根据其他列的值添加一个新列。 I have a Nested IF formula from excel that I should be implementing (for adding values to the new column), which when converted into programmatic terms, is something like this: 我有一个来自excel的Nested IF公式,我应该实现(用于向新列添加值),当转换为程序化术语时,它是这样的:

if(k =='yes')
{
  if(!(i==''))
  {
    if(diff(max_date, target_date) < 0)
    {
      if(j == '')
      {
        "pending" //the value of the column
      }
      else {
        "approved" //the value of the column
      }
    }
    else{
      "expired" //the value of the column
    }
  }
  else{
    "" //the value should be empty
  }
}
else{
  "" //the value should be empty
} 

i,j,k are three other columns in the Dataframe. I know we can use withColumn and when to add new columns based on other columns, but I am not sure how I can achieve the above logic using that approach. 我知道我们可以使用withColumn以及when根据其他列添加新列,但我不确定如何使用该方法实现上述逻辑。

what would be an easy/efficient way to implement the above logic for adding the new column? 在添加新列时实现上述逻辑的简单/有效方法是什么? Any help would be appreciated. 任何帮助,将不胜感激。

Thank you. 谢谢。

First thing, lets simplify that if statement: 首先,让我们简化if语句:

if(k == "yes" && i.nonEmpty)
  if(maxDate - targetDate < 0)
    if (j.isEmpty) "pending" 
    else "approved"
  else "expired"
else ""

Now there are 2 main ways to accomplish this 现在有两种主要方法可以实现这一目标

  1. Using a custom UDF 使用自定义UDF
  2. Using spark built in functions: coalesce , when , otherwise 使用spark内置函数: coalescewhenotherwise

Custom UDF 自定义UDF

Now due to the complexity of your conditions, it will be rather tricky to do number 2. Using a custom UDF should suit your needs. 现在由于条件的复杂性,编号2会相当棘手。使用自定义UDF应该适合您的需求。

def getState(i: String, j: String, k: String, maxDate: Long, targetDate: Long): String =  
  if(k == "yes" && i.nonEmpty)
    if(maxDate - targetDate < 0)
      if (j.isEmpty) "pending" 
      else "approved"
    else "expired"
  else ""

val stateUdf = udf(getState _)
df.withColumn("state", stateUdf($"i",$"j",$"k",lit(0),lit(0)))

Just change lit(0) and lit(0) to your date code, and this should work for you. 只需将lit(0)和lit(0)更改为您的日期代码,这应该适合您。

Using spark built in functions 使用spark内置函数

If you notice performance issues, you can switch to using coalesce , otherwise , and when , which would look something like this: 如果您注意到性能问题,可以切换到使用coalesceotherwise ,以及when ,这看起来像这样:

val isApproved = df.withColumn("state", when($"k" === "yes" && $"i" =!= "" && (lit(max_date) - lit(target_date) < 0) && $"j" =!= "", "approved").otherwise(null))
val isPending = isApproved.withColumn("state", coalesce($"state", when($"k" === "yes" && $"i" =!= "" && (lit(max_date) - lit(target_date) < 0) && $"j" === "", "pending").otherwise(null)))
val isExpired = isPending.withColumn("state", coalesce($"state", when($"k" === "yes" && $"i" =!= "" && (lit(max_date) - lit(target_date) >= 0), "expired").otherwise(null)))
val finalDf = isExpired.withColumn("state", coalesce($"state", lit("")))

I've used custom udf's in the past with large input sources without issues, and custom udfs can lead to much more readable code, especially in this case. 我过去使用自定义udf来处理大型输入源而没有问题,而自定义udfs可以使代码更易读,特别是在这种情况下。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 scala 根据 Spark DataFrame 中现有列的聚合添加新列 - Adding new Columns based on aggregation on existing column in Spark DataFrame using scala 向数据框添加新列的问题-Spark / Scala - Problems with adding a new column to a dataframe - spark/scala 使用Scala将多列转换为Spark Dataframe上的一列地图 - Convert multiple columns into a column of map on Spark Dataframe using Scala 使用其他现有列 Spark/Scala 添加新列 - Adding new column using other existing columns Spark/Scala Spark Dataframe,使用其他列添加具有功能的新列 - Spark Dataframe, add new column with function using other columns scala/spark - 将数据框分组并从其他列中选择值作为数据框 - scala/spark - group dataframe and select values from other column as dataframe 将数组值作为新列添加到 spark 数据框 - Adding array values to a spark dataframe as new column Spark Scala将数据框列复制到新数据框 - Spark scala copying dataframe column to new dataframe 根据Spark Scala数据框中其他列的值和顺序添加派生列(作为结构数组) - Add derived column (as array of struct) based on values and ordering of other columns in Spark Scala dataframe 将 dataframe 列的数组展平为单独的列和 Spark scala 中的相应值 - Flattening the array of a dataframe column into separate columns and corresponding values in Spark scala
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM