简体   繁体   中英

Spark Scala - Read different parquet files with different schema and write to different output paths

I have a parent folder with child folders, each children folder contain a parquet file (represent a table), like that:

|Parent_input_folder:
|--- Children_folder1:
|      |--- file1.parquet
|--- Children_folder2 :
       |--- file2.parquet

The goal is to read from these folders and write to the output folders after transformations with spark scala:

|Parent_output_folder:
|--- Children_folder1:
|      |--- file1.parquet
|--- Children_folder2 :
       |--- file2.parquet

Note: each file have different schema to other

Have you some idea to do this in spark scala?

A way to do this which will get you almost what you want is the combination of input_file_path and partitionBy as below:

val results = table
  .withColumn("path", input_file_name())
  .withColumn("path", concat_ws("\\", slice(split(col("path"), "/"), 8, 2))) // get the path in the format that you want

results
  .write
  .partitionBy("path") // partition by your path column
  .parquet("structured")

Good luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM