使用数据框中多个其他列的值将新列添加到Dataframe - spark / scala

Question

I am new to spark SQL and Dataframes. 我是新手来激发SQL和Dataframes。 I have a Dataframe to which I should be adding a new column based on the values of other columns. 我有一个Dataframe ，我应该根据其他列的值添加一个新列。 I have a Nested IF formula from excel that I should be implementing (for adding values to the new column), which when converted into programmatic terms, is something like this: 我有一个来自excel的Nested IF公式，我应该实现（用于向新列添加值），当转换为程序化术语时，它是这样的：

if(k =='yes')
{
  if(!(i==''))
  {
    if(diff(max_date, target_date) < 0)
    {
      if(j == '')
      {
        "pending" //the value of the column
      }
      else {
        "approved" //the value of the column
      }
    }
    else{
      "expired" //the value of the column
    }
  }
  else{
    "" //the value should be empty
  }
}
else{
  "" //the value should be empty
}

i,j,k are three other columns in the Dataframe. I know we can use withColumn and when to add new columns based on other columns, but I am not sure how I can achieve the above logic using that approach. 我知道我们可以使用withColumn以及when根据其他列添加新列，但我不确定如何使用该方法实现上述逻辑。

what would be an easy/efficient way to implement the above logic for adding the new column? 在添加新列时实现上述逻辑的简单/有效方法是什么？ Any help would be appreciated. 任何帮助，将不胜感激。

Thank you. 谢谢。

Answer 1

First thing, lets simplify that if statement: 首先，让我们简化if语句：

if(k == "yes" && i.nonEmpty)
  if(maxDate - targetDate < 0)
    if (j.isEmpty) "pending" 
    else "approved"
  else "expired"
else ""

Now there are 2 main ways to accomplish this 现在有两种主要方法可以实现这一目标

Using a custom UDF 使用自定义UDF
Using spark built in functions: coalesce , when , otherwise 使用spark内置函数： coalesce ， when ， otherwise

Custom UDF 自定义UDF

Now due to the complexity of your conditions, it will be rather tricky to do number 2. Using a custom UDF should suit your needs. 现在由于条件的复杂性，编号2会相当棘手。使用自定义UDF应该适合您的需求。

def getState(i: String, j: String, k: String, maxDate: Long, targetDate: Long): String =  
  if(k == "yes" && i.nonEmpty)
    if(maxDate - targetDate < 0)
      if (j.isEmpty) "pending" 
      else "approved"
    else "expired"
  else ""

val stateUdf = udf(getState _)
df.withColumn("state", stateUdf($"i",$"j",$"k",lit(0),lit(0)))

Just change lit(0) and lit(0) to your date code, and this should work for you. 只需将lit（0）和lit（0）更改为您的日期代码，这应该适合您。

Using spark built in functions 使用spark内置函数

If you notice performance issues, you can switch to using coalesce , otherwise , and when , which would look something like this: 如果您注意到性能问题，可以切换到使用coalesce ， otherwise ，以及when ，这看起来像这样：

val isApproved = df.withColumn("state", when($"k" === "yes" && $"i" =!= "" && (lit(max_date) - lit(target_date) < 0) && $"j" =!= "", "approved").otherwise(null))
val isPending = isApproved.withColumn("state", coalesce($"state", when($"k" === "yes" && $"i" =!= "" && (lit(max_date) - lit(target_date) < 0) && $"j" === "", "pending").otherwise(null)))
val isExpired = isPending.withColumn("state", coalesce($"state", when($"k" === "yes" && $"i" =!= "" && (lit(max_date) - lit(target_date) >= 0), "expired").otherwise(null)))
val finalDf = isExpired.withColumn("state", coalesce($"state", lit("")))

I've used custom udf's in the past with large input sources without issues, and custom udfs can lead to much more readable code, especially in this case. 我过去使用自定义udf来处理大型输入源而没有问题，而自定义udfs可以使代码更易读，特别是在这种情况下。

使用数据框中多个其他列的值将新列添加到Dataframe - spark / scala

问题描述

1 个解决方案

解决方案1
5 已采纳 2017-11-23 00:11:08

Custom UDF 自定义UDF

Using spark built in functions 使用spark内置函数

使用数据框中多个其他列的值将新列添加到Dataframe - spark / scala

问题描述

1 个解决方案

解决方案1 5 已采纳 2017-11-23 00:11:08

Custom UDF 自定义UDF

Using spark built in functions 使用spark内置函数

解决方案1
5 已采纳 2017-11-23 00:11:08