Spark 1.6.2 and Scala 2.10 here.
I want to filter the spark dataframe column with an array of strings.
val df1 = sc.parallelize(Seq((1, "L-00417"), (3, "L-00645"), (4, "L-99999"),(5, "L-00623"))).toDF("c1","c2")
+---+-------+
| c1| c2|
+---+-------+
| 1|L-00417|
| 3|L-00645|
| 4|L-99999|
| 5|L-00623|
+---+-------+
val df2 = sc.parallelize(Seq((1, "L-1"), (3, "L-2"), (4, "L-3"),(5, "L-00623"))).toDF("c3","c4")
+---+-------+
| c3| c4|
+---+-------+
| 1| L-1|
| 3| L-2|
| 4| L-3|
| 5|L-00623|
+---+-------+
val c2List = df1.select("c2").as[String].collect()
df2.filter(not($"c4").contains(c2List)).show()`
I am getting below error.
Unsupported literal type class [Ljava.lang.String; [Ljava.lang.String;@5ce1739c
Can anyone please help to fix this?
First, contains
isn't suitable because you're looking for the opposite relationship - you want to check if c2List
contains c4
's value, and not the other way around.
You can use isin
for that - which uses "repeated argument" (similar to Java's "varargs") of the values to match, so you'd want to "expand" c2List
into a repeated argument, which can be done using the : _*
operator:
df2.filter(not($"c4".isin(c2List: _*)))
Alternatively , with Spark 1.6 you can use an "left anti join" , to join the two dataframes and get only values in df2
that did NOT match values in df1
:
df2.join(df1, $"c2" === $"c4", "leftanti")
Unlike the previous, this option is not limited to the case where df1
is small enough to be collected.
Lastly, if you're using earlier Spark version, you can immitate leftanti
using a left
join and a filter:
df2.join(df1, $"c2" === $"c4", "left").filter($"c2".isNull).select("c3", "c4")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.