简体   繁体   中英

Load folder name as column in delta table

/SRC1/trialbucket=1/1.parquet
/SRC1/trialbucket=2/2.parquet
/SRC2/trialbucket=1/3.parquet
/SRC2/trialbucket=2/4.parquet

All the above parquet files in the folders have the same schema.

eg. Col1,Col2,Col3

I have to load all the files into a delta table with the following schema

Col1,Col2,Col3,Source

data1,data2,data3,SRC1

data11,data22,data33,SRC1
data1111,data222,data333,SRC1
data5,data6,data7,SRC2
data55,data66,data77,SRC2
data555,data666,data777,SRC2

I can do it with each and add the folder name as the last column ( .withColumn ) but I have to go through 10000 such folders to read all the parquet files and load them into a table which takes a lot of time!

Is there any other way without the for loop to get the folder name and add it to the column?

You can use regexp_extract to get the foot folder name from the input_file_name :

val df1 = df.withColumn("Source", regexp_extract(input_file_name(), "/(.*)/.*", 1))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM