Merge two parquet files using Dataframe in Spark java

Question

I have two parquet files with same schema. I want to merge second file with first file using Dataframe in Spark java without any duplicate data. How to do this?

Thanks in advance.

Answer 1

First, read your two parquet files into dataframes:

Dataset<Row> df1 = spark.read.parquet("dataset1.parquet");
Dataset<Row> df2 = spark.read.parquet("dataset2.parquet");

Then, use unionAll (Spark 1.X) or union (Spark 2.X) to merge the second df with the first. Finally, since this function will keep duplicates, use distinct :

Dataset<Row> df_merged = df1.union(df2).distinct();

Answer 2

创建数据框并使用等联接

 val output = df1.join(df2,Seq("id"),joinType="Inner")

Merge two parquet files using Dataframe in Spark java

Question

2 answers

solution1
1 2017-08-14 07:00:58

solution2
1 2017-08-14 17:39:21

Merge two parquet files using Dataframe in Spark java

Question

2 answers

solution1 1 2017-08-14 07:00:58

solution2 1 2017-08-14 17:39:21

solution1
1 2017-08-14 07:00:58

solution2
1 2017-08-14 17:39:21