/SRC1/trialbucket=1/1.parquet
/SRC1/trialbucket=2/2.parquet
/SRC2/trialbucket=1/3.parquet
/SRC2/trialbucket=2/4.parquet
All the above parquet files in the folders have the same schema.
eg. Col1,Col2,Col3
I have to load all the files into a delta table with the following schema
Col1,Col2,Col3,Source
data1,data2,data3,SRC1
data11,data22,data33,SRC1
data1111,data222,data333,SRC1
data5,data6,data7,SRC2
data55,data66,data77,SRC2
data555,data666,data777,SRC2
I can do it with each and add the folder name as the last column ( .withColumn
) but I have to go through 10000 such folders to read all the parquet files and load them into a table which takes a lot of time!
Is there any other way without the for loop to get the folder name and add it to the column?
You can use regexp_extract
to get the foot folder name from the input_file_name
:
val df1 = df.withColumn("Source", regexp_extract(input_file_name(), "/(.*)/.*", 1))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.