I have dataframe with 2 ArrayType columns. I want to find the difference between columns. column1 will always have values while column2 may have empty array. I created following udf but it is not working
df.show()
gives following records
SampleData:
["Test", "Test1","Test3", "Test2"], ["Test", "Test1"]
Code:
sc.udf.register("diff", (value: Column,value1: Column)=>{
value.asInstanceOf[Seq[String]].diff(value1.asInstanceOf[Seq[String]])
})
Output:
["Test2","Test3"]
Spark version 1.4.1 Any help will be appreciated.
column1 will always have values while column2 may have empty array.
your comment : it gives all values of value – undefined_variable
lets see small example like this...
val A = Seq(1,1)
A: Seq[Int] = List(1, 1)
val B = Seq.empty
B: Seq[Nothing] = List()
A diff B
res0: Seq[Int] = List(1, 1)
if you do a collection.SeqLike.diff
then you will get A value as shown in example. As per scala, this is very much valid case since you told you are always getting value
which is seq.
Also, reverse case is like this...
B diff A
res1: Seq[Nothing] = List()
if you use Spark udf for doing above as well then same results will come.
val p = Seq("Test", "Test1","Test3", "Test2")
p: Seq[String] = List(Test, Test1, Test3, Test2)
val q = Seq("Test", "Test1")
q: Seq[String] = List(Test, Test1)
p diff q
res2: Seq[String] = List(Test3, Test2)
This is what your expected output which is coming as given in your example.
q diff p
res3: Seq[String] = List()
You need to change your udf
to:
val diff_udf = udf { ( a: Seq[String],
b: Seq[String]) => a diff b }
Then this works:
import org.apache.spark.sql.functions.col
df.withColumn("diff",
diff_udf(col("col1"), col("col2"))).show
+--------------------+-----------------+------------------+
| col1| col2| diff|
+--------------------+-----------------+------------------+
|List(Test, Test1,...|List(Test, Test1)|List(Test3, Test2)|
+--------------------+-----------------+------------------+
Data
val df = sc.parallelize(Seq((List("Test", "Test1","Test3", "Test2"),
List("Test", "Test1")))).toDF("col1", "col2")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.