简体   繁体   中英

Difference between columns of ArrayType in dataframe

I have dataframe with 2 ArrayType columns. I want to find the difference between columns. column1 will always have values while column2 may have empty array. I created following udf but it is not working

df.show() gives following records

SampleData:

["Test", "Test1","Test3", "Test2"], ["Test", "Test1"]

Code:

sc.udf.register("diff", (value: Column,value1: Column)=>{ 
                        value.asInstanceOf[Seq[String]].diff(value1.asInstanceOf[Seq[String]])          
                    })  

Output:

["Test2","Test3"]

Spark version 1.4.1 Any help will be appreciated.

column1 will always have values while column2 may have empty array.

your comment : it gives all values of value – undefined_variable

Example1 :

lets see small example like this...

   val A = Seq(1,1)

 A: Seq[Int] = List(1, 1)

 val B = Seq.empty

 B: Seq[Nothing] = List()

A diff B

 res0: Seq[Int] = List(1, 1)

if you do a collection.SeqLike.diff then you will get A value as shown in example. As per scala, this is very much valid case since you told you are always getting value which is seq.

Also, reverse case is like this...

 B diff A

 res1: Seq[Nothing] = List()

if you use Spark udf for doing above as well then same results will come.

EDIT : (if one array not empty case as you modified your example )

Example2 :

 val p = Seq("Test", "Test1","Test3", "Test2")

 p: Seq[String] = List(Test, Test1, Test3, Test2)

 val q = Seq("Test", "Test1")

 q: Seq[String] = List(Test, Test1)

 p diff q

 res2: Seq[String] = List(Test3, Test2)

This is what your expected output which is coming as given in your example.

Reverse case : I think this is what you are getting which is not expected by you.

q diff p

 res3: Seq[String] = List()

You need to change your udf to:

val diff_udf = udf { ( a:  Seq[String], 
                       b:  Seq[String]) => a diff b }

Then this works:

import org.apache.spark.sql.functions.col
df.withColumn("diff",
  diff_udf(col("col1"), col("col2"))).show
+--------------------+-----------------+------------------+
|                col1|             col2|              diff|
+--------------------+-----------------+------------------+
|List(Test, Test1,...|List(Test, Test1)|List(Test3, Test2)|
+--------------------+-----------------+------------------+

Data

val df = sc.parallelize(Seq((List("Test", "Test1","Test3", "Test2"), 
                             List("Test", "Test1")))).toDF("col1", "col2")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM