简体   繁体   English

加载文件夹名称作为增量表中的列

[英]Load folder name as column in delta table

/SRC1/trialbucket=1/1.parquet
/SRC1/trialbucket=2/2.parquet
/SRC2/trialbucket=1/3.parquet
/SRC2/trialbucket=2/4.parquet

All the above parquet files in the folders have the same schema.文件夹中的所有上述 parquet 文件具有相同的架构。

eg.例如。 Col1,Col2,Col3

I have to load all the files into a delta table with the following schema我必须将所有文件加载到具有以下架构的增量表中

Col1,Col2,Col3,Source

data1,data2,data3,SRC1

data11,data22,data33,SRC1
data1111,data222,data333,SRC1
data5,data6,data7,SRC2
data55,data66,data77,SRC2
data555,data666,data777,SRC2

I can do it with each and add the folder name as the last column ( .withColumn ) but I have to go through 10000 such folders to read all the parquet files and load them into a table which takes a lot of time!我可以对每个文件执行此操作并将文件夹名称添加为最后一列( .withColumn ),但我必须通过 10000 个此类文件夹 go 读取所有镶木地板文件并将它们加载到需要大量时间的表中!

Is there any other way without the for loop to get the folder name and add it to the column?有没有其他方法没有 for 循环来获取文件夹名称并将其添加到列中?

You can use regexp_extract to get the foot folder name from the input_file_name :您可以使用regexp_extractinput_file_name获取 foot 文件夹名称:

val df1 = df.withColumn("Source", regexp_extract(input_file_name(), "/(.*)/.*", 1))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM