Spark：有条件地将列添加到数据框

Question

I am trying to take my input data:我正在尝试获取我的输入数据：

A    B       C
--------------
4    blah    2
2            3
56   foo     3

And add a column to the end based on whether B is empty or not:并根据 B 是否为空在末尾添加一列：

A    B       C     D
--------------------
4    blah    2     1
2            3     0
56   foo     3     1

I can do this easily by registering the input dataframe as a temp table, then typing up a SQL query.我可以通过将输入数据框注册为临时表，然后键入 SQL 查询来轻松完成此操作。

But I'd really like to know how to do this with just Scala methods and not having to type out a SQL query within Scala.但我真的很想知道如何仅使用 Scala 方法来做到这一点，而不必在 Scala 中键入 SQL 查询。

I've tried .withColumn , but I can't get that to do what I want.我试过.withColumn ，但我不能.withColumn做我想做的事。

Answer 1

Try withColumn with the function when as follows: 尝试withColumn与功能when ，如下所示：

val sqlContext = new SQLContext(sc)
import sqlContext.implicits._ // for `toDF` and $""
import org.apache.spark.sql.functions._ // for `when`

val df = sc.parallelize(Seq((4, "blah", 2), (2, "", 3), (56, "foo", 3), (100, null, 5)))
    .toDF("A", "B", "C")

val newDf = df.withColumn("D", when($"B".isNull or $"B" === "", 0).otherwise(1))

newDf.show() shows newDf.show()显示

+---+----+---+---+
|  A|   B|  C|  D|
+---+----+---+---+
|  4|blah|  2|  1|
|  2|    |  3|  0|
| 56| foo|  3|  1|
|100|null|  5|  0|
+---+----+---+---+

I added the (100, null, 5) row for testing the isNull case. 我添加了(100, null, 5)行以测试isNull情况。

I tried this code with Spark 1.6.0 but as commented in the code of when , it works on the versions after 1.4.0 . 我在Spark 1.6.0尝试过此代码，但正如when代码中所述，它适用于1.4.0之后的版本。

Answer 2

My bad, I had missed one part of the question. 不好，我错过了问题的一部分。

Best, cleanest way is to use a UDF . 最好，最干净的方法是使用UDF 。 Explanation within the code. 代码中的解释。

// create some example data...BY DataFrame
// note, third record has an empty string
case class Stuff(a:String,b:Int)
val d= sc.parallelize(Seq( ("a",1),("b",2),
     ("",3) ,("d",4)).map { x => Stuff(x._1,x._2)  }).toDF

// now the good stuff.
import org.apache.spark.sql.functions.udf
// function that returns 0 is string empty 
val func = udf( (s:String) => if(s.isEmpty) 0 else 1 )
// create new dataframe with added column named "notempty"
val r = d.select( $"a", $"b", func($"a").as("notempty") )

    scala> r.show
+---+---+--------+
|  a|  b|notempty|
+---+---+--------+
|  a|  1|    1111|
|  b|  2|    1111|
|   |  3|       0|
|  d|  4|    1111|
+---+---+--------+

Answer 3

How about something like this? 这样的事情怎么样？

val newDF = df.filter($"B" === "").take(1) match {
  case Array() => df
  case _ => df.withColumn("D", $"B" === "")
}

Using take(1) should have a minimal hit 使用take(1)应该最小

Answer 4

you can use something like below code.你可以使用类似下面的代码。

df.withColumn("active_status",when(datediff($"login_date",coalesce(lag($"login_date",1).over(windowspec),$"login_date"))>5,$"login_date").otherwise(coalesce(lag($"login_date",1).over(windowspec),$"login_date"))).show() df.withColumn("active_status",when(datediff($"login_date",coalesce(lag($"login_date",1).over(windowspec),$"login_date"))>5,$"login_date").否则(coalesce(lag($"login_date",1).over(windowspec),$"login_date"))).show()

enter image description here在此处输入图片说明

Spark：有条件地将列添加到数据框

问题描述

3 个解决方案

解决方案1
78 已采纳 2016-01-21 06:04:51

解决方案2
3 2016-01-20 19:53:48

解决方案3
1 2016-01-20 19:53:39

解决方案4
0 2021-12-23 19:23:30

Spark：有条件地将列添加到数据框

问题描述

3 个解决方案

解决方案1 78 已采纳 2016-01-21 06:04:51

解决方案2 3 2016-01-20 19:53:48

解决方案3 1 2016-01-20 19:53:39

解决方案4 0 2021-12-23 19:23:30

解决方案1
78 已采纳 2016-01-21 06:04:51

解决方案2
3 2016-01-20 19:53:48

解决方案3
1 2016-01-20 19:53:39

解决方案4
0 2021-12-23 19:23:30