数据框中ArrayType列之间的差异

Question

I have dataframe with 2 ArrayType columns. 我有2个ArrayType列的数据框。 I want to find the difference between columns. 我想找到列之间的区别。 column1 will always have values while column2 may have empty array. column1将始终具有值，而column2可能具有空数组。 I created following udf but it is not working 我创建了以下udf，但它无法正常工作

df.show() gives following records df.show()给出以下记录

SampleData: 样本数据：

["Test", "Test1","Test3", "Test2"], ["Test", "Test1"]

Code: 码：

sc.udf.register("diff", (value: Column,value1: Column)=>{ 
                        value.asInstanceOf[Seq[String]].diff(value1.asInstanceOf[Seq[String]])          
                    })

Output: 输出：

["Test2","Test3"]

Spark version 1.4.1 Any help will be appreciated. Spark版本1.4.1任何帮助将不胜感激。

Answer 1

column1 will always have values while column2 may have empty array. column1将始终具有值，而column2可能具有空数组。

your comment : it gives all values of value – undefined_variable 你的评论 ：它给出了所有值的值 - undefined_variable

Example1 : 例1：

lets see small example like this... 让我们看看像这样的小例子......

   val A = Seq(1,1)

 A: Seq[Int] = List(1, 1)

 val B = Seq.empty

 B: Seq[Nothing] = List()

A diff B

 res0: Seq[Int] = List(1, 1)

if you do a collection.SeqLike.diff then you will get A value as shown in example. 如果你做一个collection.SeqLike.diff然后你会得到一个值，如例子所示。 As per scala, this is very much valid case since you told you are always getting value which is seq. 根据scala，这是非常有效的情况，因为你告诉你总是得到seq的value 。

Also, reverse case is like this... 另外，反向情况就是这样......

 B diff A

 res1: Seq[Nothing] = List()

if you use Spark udf for doing above as well then same results will come. 如果您使用Spark udf进行上述操作，则会产生相同的结果。

EDIT : (if one array not empty case as you modified your example ) 编辑:(如果你修改你的例子，一个数组不是空的情况）

Example2 : 例2：

 val p = Seq("Test", "Test1","Test3", "Test2")

 p: Seq[String] = List(Test, Test1, Test3, Test2)

 val q = Seq("Test", "Test1")

 q: Seq[String] = List(Test, Test1)

 p diff q

 res2: Seq[String] = List(Test3, Test2)

This is what your expected output which is coming as given in your example. 这是您的预期输出，如您的示例中所示。

Reverse case : I think this is what you are getting which is not expected by you. 反向案例：我认为这是你得到的，这是你不期望的。

q diff p

 res3: Seq[String] = List()

Answer 2

You need to change your udf to: 您需要将您的udf更改为：

val diff_udf = udf { ( a:  Seq[String], 
                       b:  Seq[String]) => a diff b }

Then this works: 然后这工作：

import org.apache.spark.sql.functions.col
df.withColumn("diff",
  diff_udf(col("col1"), col("col2"))).show
+--------------------+-----------------+------------------+
|                col1|             col2|              diff|
+--------------------+-----------------+------------------+
|List(Test, Test1,...|List(Test, Test1)|List(Test3, Test2)|
+--------------------+-----------------+------------------+

Data 数据

val df = sc.parallelize(Seq((List("Test", "Test1","Test3", "Test2"), 
                             List("Test", "Test1")))).toDF("col1", "col2")

数据框中ArrayType列之间的差异

问题描述

2 个解决方案

解决方案1
2 2016-12-15 09:41:00

Example1 : 例1：

EDIT : (if one array not empty case as you modified your example ) 编辑:(如果你修改你的例子，一个数组不是空的情况）

Example2 : 例2：

Reverse case : I think this is what you are getting which is not expected by you. 反向案例：我认为这是你得到的，这是你不期望的。

解决方案2
1 已采纳 2016-12-15 10:11:04

数据框中ArrayType列之间的差异

问题描述

2 个解决方案

解决方案1 2 2016-12-15 09:41:00

Example1 : 例1：

EDIT : (if one array not empty case as you modified your example ) 编辑:(如果你修改你的例子，一个数组不是空的情况）

Example2 : 例2：

Reverse case : I think this is what you are getting which is not expected by you. 反向案例：我认为这是你得到的，这是你不期望的。

解决方案2 1 已采纳 2016-12-15 10:11:04

解决方案1
2 2016-12-15 09:41:00

解决方案2
1 已采纳 2016-12-15 10:11:04