简体   繁体   中英

Java + Apache Spark inner join between two datesets

Small question regarding how to achieve inner join using Apache Spark and Java please.

I have this very simple piece of code.

final Dataset<Row> dataSetTableLeft = getDatasetForLeft();
final Dataset<Row> dataSetTableRight = getDatasetForRight();
final Dataset<Row> dataSetTableResult = dataSetTableLeft.join(dataSetTableRight);

dataSetTableLeft.show();
dataSetTableRight.show();
dataSetTableResult.show();

The first table dataSetTableLeft.show , looks like this, very straightforward.

+----------+-----+
|      time|label|
+----------+-----+
|1637020800|    0|
|1637107200|    0|
|1637193600|    0|
|1637280000|    0|
|1637366400|    0|
|1637452800|    0|
+----------+-----+"

The second table dataSetTableRight , looks like this, also very straightforward.

+----------+-----+
|      time|label|
+----------+-----+
|1637193600|    1|
|1637280000|    2|
|1637366400|    1|
+----------+-----+"

What I would like to achieve is a result table like this, which is a kind of inner join.

+----------+-----+
|      time|label|
+----------+-----+
|1637020800|    0|
|1637107200|    0|
|1637193600|    1|
|1637280000|    2|
|1637366400|    1|
|1637452800|    0|
+----------+-----+"

Unfortunately, I am not seeing any function like this. innerjoin()

Therefore, I am trying some kind of combinations

dataSetTableLeft.unionAll(dataSetTableRight);
dataSetTableLeft.crossJoin(dataSetTableRight);

without any luck so far.

What is the correct way to achieve the inner join?

Execute anti join first, and then union .

dataSetTableLeft.join(dataSetTableRight, on='time', how='anti').union(dataSetTableRight)

Inner join will only give you rows that exist in the two datasets. What you want here is to keep all rows from the dataSetTableLeft and retrieve label value from dataSetTableRight when time matches.

For this, simply use left join and coalesce function to get label value from the second dataset when there is a match.

I'm not using Spark-Java to test it but something like this should give you the expected result:

import static org.apache.spark.sql.functions.*;


dataSetTableLeft.join(
    dataSetTableRight,
    dataSetTableLeft.col("time").equalTo(dataSetTableRight.col("time")),
    "left_outer"
).select(
    dataSetTableLeft.col("time"),
    coalesce(dataSetTableRight.col("label"), dataSetTableLeft.col("label")).alias("label")
).show();

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM