[英]Adding a new column to a Dataframe by using the values of multiple other columns in the dataframe - spark/scala
I am new to spark SQL and Dataframes. 我是新手来激发SQL和Dataframes。 I have a
Dataframe
to which I should be adding a new column based on the values of other columns. 我有一个
Dataframe
,我应该根据其他列的值添加一个新列。 I have a Nested IF
formula from excel that I should be implementing (for adding values to the new column), which when converted into programmatic terms, is something like this: 我有一个来自excel的
Nested IF
公式,我应该实现(用于向新列添加值),当转换为程序化术语时,它是这样的:
if(k =='yes')
{
if(!(i==''))
{
if(diff(max_date, target_date) < 0)
{
if(j == '')
{
"pending" //the value of the column
}
else {
"approved" //the value of the column
}
}
else{
"expired" //the value of the column
}
}
else{
"" //the value should be empty
}
}
else{
"" //the value should be empty
}
i,j,k are three other columns in the Dataframe.
I know we can use withColumn
and when
to add new columns based on other columns, but I am not sure how I can achieve the above logic using that approach. 我知道我们可以使用
withColumn
以及when
根据其他列添加新列,但我不确定如何使用该方法实现上述逻辑。
what would be an easy/efficient way to implement the above logic for adding the new column? 在添加新列时实现上述逻辑的简单/有效方法是什么? Any help would be appreciated.
任何帮助,将不胜感激。
Thank you. 谢谢。
First thing, lets simplify that if statement: 首先,让我们简化if语句:
if(k == "yes" && i.nonEmpty)
if(maxDate - targetDate < 0)
if (j.isEmpty) "pending"
else "approved"
else "expired"
else ""
Now there are 2 main ways to accomplish this 现在有两种主要方法可以实现这一目标
coalesce
, when
, otherwise
coalesce
, when
, otherwise
Now due to the complexity of your conditions, it will be rather tricky to do number 2. Using a custom UDF should suit your needs. 现在由于条件的复杂性,编号2会相当棘手。使用自定义UDF应该适合您的需求。
def getState(i: String, j: String, k: String, maxDate: Long, targetDate: Long): String =
if(k == "yes" && i.nonEmpty)
if(maxDate - targetDate < 0)
if (j.isEmpty) "pending"
else "approved"
else "expired"
else ""
val stateUdf = udf(getState _)
df.withColumn("state", stateUdf($"i",$"j",$"k",lit(0),lit(0)))
Just change lit(0) and lit(0) to your date code, and this should work for you. 只需将lit(0)和lit(0)更改为您的日期代码,这应该适合您。
If you notice performance issues, you can switch to using coalesce
, otherwise
, and when
, which would look something like this: 如果您注意到性能问题,可以切换到使用
coalesce
, otherwise
,以及when
,这看起来像这样:
val isApproved = df.withColumn("state", when($"k" === "yes" && $"i" =!= "" && (lit(max_date) - lit(target_date) < 0) && $"j" =!= "", "approved").otherwise(null))
val isPending = isApproved.withColumn("state", coalesce($"state", when($"k" === "yes" && $"i" =!= "" && (lit(max_date) - lit(target_date) < 0) && $"j" === "", "pending").otherwise(null)))
val isExpired = isPending.withColumn("state", coalesce($"state", when($"k" === "yes" && $"i" =!= "" && (lit(max_date) - lit(target_date) >= 0), "expired").otherwise(null)))
val finalDf = isExpired.withColumn("state", coalesce($"state", lit("")))
I've used custom udf's in the past with large input sources without issues, and custom udfs can lead to much more readable code, especially in this case. 我过去使用自定义udf来处理大型输入源而没有问题,而自定义udfs可以使代码更易读,特别是在这种情况下。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.