Spark Scala - Read different parquet files with different schema and write to different output paths

Question

I have a parent folder with child folders, each children folder contain a parquet file (represent a table), like that:

|Parent_input_folder:
|--- Children_folder1:
|      |--- file1.parquet
|--- Children_folder2 :
       |--- file2.parquet

The goal is to read from these folders and write to the output folders after transformations with spark scala:

|Parent_output_folder:
|--- Children_folder1:
|      |--- file1.parquet
|--- Children_folder2 :
       |--- file2.parquet

Note: each file have different schema to other

Have you some idea to do this in spark scala?

Answer 1

A way to do this which will get you almost what you want is the combination of input_file_path and partitionBy as below:

val results = table
  .withColumn("path", input_file_name())
  .withColumn("path", concat_ws("\\", slice(split(col("path"), "/"), 8, 2))) // get the path in the format that you want

results
  .write
  .partitionBy("path") // partition by your path column
  .parquet("structured")

Good luck!

Spark Scala - Read different parquet files with different schema and write to different output paths

Question

1 answers

solution1
0 2022-12-01 00:03:08

Spark Scala - Read different parquet files with different schema and write to different output paths

Question

1 answers

solution1 0 2022-12-01 00:03:08

solution1
0 2022-12-01 00:03:08