I have a parent folder with child folders, each children folder contain a parquet file (represent a table), like that:
|Parent_input_folder:
|--- Children_folder1:
| |--- file1.parquet
|--- Children_folder2 :
|--- file2.parquet
The goal is to read from these folders and write to the output folders after transformations with spark scala:
|Parent_output_folder:
|--- Children_folder1:
| |--- file1.parquet
|--- Children_folder2 :
|--- file2.parquet
Note: each file have different schema to other
Have you some idea to do this in spark scala?
A way to do this which will get you almost what you want is the combination of input_file_path
and partitionBy
as below:
val results = table
.withColumn("path", input_file_name())
.withColumn("path", concat_ws("\\", slice(split(col("path"), "/"), 8, 2))) // get the path in the format that you want
results
.write
.partitionBy("path") // partition by your path column
.parquet("structured")
Good luck!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.