[英]Difference between columns of ArrayType in dataframe
I have dataframe with 2 ArrayType columns. 我有2个ArrayType列的数据框。 I want to find the difference between columns.
我想找到列之间的区别。 column1 will always have values while column2 may have empty array.
column1将始终具有值,而column2可能具有空数组。 I created following udf but it is not working
我创建了以下udf,但它无法正常工作
df.show()
gives following records df.show()
给出以下记录
SampleData: 样本数据:
["Test", "Test1","Test3", "Test2"], ["Test", "Test1"]
Code: 码:
sc.udf.register("diff", (value: Column,value1: Column)=>{
value.asInstanceOf[Seq[String]].diff(value1.asInstanceOf[Seq[String]])
})
Output: 输出:
["Test2","Test3"]
Spark version 1.4.1 Any help will be appreciated. Spark版本1.4.1任何帮助将不胜感激。
column1 will always have values while column2 may have empty array.
column1将始终具有值,而column2可能具有空数组。
your comment : it gives all values of value – undefined_variable
你的评论 :它给出了所有值的值 - undefined_variable
lets see small example like this... 让我们看看像这样的小例子......
val A = Seq(1,1)
A: Seq[Int] = List(1, 1)
val B = Seq.empty
B: Seq[Nothing] = List()
A diff B
res0: Seq[Int] = List(1, 1)
if you do a collection.SeqLike.diff
then you will get A value as shown in example. 如果你做一个
collection.SeqLike.diff
然后你会得到一个值,如例子所示。 As per scala, this is very much valid case since you told you are always getting value
which is seq. 根据scala,这是非常有效的情况,因为你告诉你总是得到seq的
value
。
Also, reverse case is like this... 另外,反向情况就是这样......
B diff A
res1: Seq[Nothing] = List()
if you use Spark udf for doing above as well then same results will come. 如果您使用Spark udf进行上述操作,则会产生相同的结果。
val p = Seq("Test", "Test1","Test3", "Test2")
p: Seq[String] = List(Test, Test1, Test3, Test2)
val q = Seq("Test", "Test1")
q: Seq[String] = List(Test, Test1)
p diff q
res2: Seq[String] = List(Test3, Test2)
This is what your expected output which is coming as given in your example. 这是您的预期输出,如您的示例中所示。
q diff p
res3: Seq[String] = List()
You need to change your udf
to: 您需要将您的
udf
更改为:
val diff_udf = udf { ( a: Seq[String],
b: Seq[String]) => a diff b }
Then this works: 然后这工作:
import org.apache.spark.sql.functions.col
df.withColumn("diff",
diff_udf(col("col1"), col("col2"))).show
+--------------------+-----------------+------------------+
| col1| col2| diff|
+--------------------+-----------------+------------------+
|List(Test, Test1,...|List(Test, Test1)|List(Test3, Test2)|
+--------------------+-----------------+------------------+
Data 数据
val df = sc.parallelize(Seq((List("Test", "Test1","Test3", "Test2"),
List("Test", "Test1")))).toDF("col1", "col2")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.