I'm trying to implement the following simple Query in Flink's Dataset API.
select
t1_value1
from
table1
where
t1_suppkey not in (
select
t2_suppkey
from
table2
)
So my idea was to perform a Left Outer Join (table1.leftOuterJoin(table2)...) and then delete all the rows where I get a value for t1_suppkey and t2_suppkey.
So I tried it like this:
output = table1
.leftOuterJoin(table2).where("t1_suppkey").equalTo("t2_suppkey")
.with((Table1 t1, Table2 t2) -> new Tuple2<>(t1.ps_suppkey, t2.s_suppkey))
.returns(new TypeHint <Tuple2<Integer, Integer>>() {});
However if I do it like this it always fails with "java.lang.NullPointerException" and I'm not sure why. If I use a normal Join instead of a Left Outer Join the code works, but that's not what I want.
Do I need to implement a Left Join differently or is there a more simple way to rewrite the "not in" statement in the Dataset API?
The outer join of the DataSet API calls the JoinFunction
also for outer records that don't find a joining record on the inner side. In this case, the JoinFunction.join()
method is called with null
.
Since you are using a LEFT OUTER JOIN, the second argument Table2 t2
can be null
. The NullPointerException
is caused by t2.s_suppkey
. You need to check for t2 == null
and only access t2
if it is not null.
You can also implement the NOT IN join with a FlatJoinFunction
that has a Collector
argument and only emit t1
if t2 == null
.
Another option is to use Flink's batch SQL support which supports the query in your example.
output = table1
.leftOuterJoin(table2)
.where("t1_suppkey").equalTo("t2_suppkey")
.with((Table1 t1, Table2 t2, Collector<Tuple2<Integer, Integer>> c) -> {
if(t2 == null) {
c.collect(new Tuple2<>(t1.t1_suppkey, t1.t1_value1));
}
else {
//Do nothing.
}})
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.