简体   繁体   中英

Merge two parquet files using Dataframe in Spark java

I have two parquet files with same schema. I want to merge second file with first file using Dataframe in Spark java without any duplicate data. How to do this?

Thanks in advance.

First, read your two parquet files into dataframes:

Dataset<Row> df1 = spark.read.parquet("dataset1.parquet");
Dataset<Row> df2 = spark.read.parquet("dataset2.parquet");

Then, use unionAll (Spark 1.X) or union (Spark 2.X) to merge the second df with the first. Finally, since this function will keep duplicates, use distinct :

Dataset<Row> df_merged = df1.union(df2).distinct();

创建数据框并使用等联接

 val output = df1.join(df2,Seq("id"),joinType="Inner")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM