简体   繁体   中英

Delta table : COPY INTO only specific partitioned folders from S3 bucket

I have one landing S3 buckets, which will receive each day some data, partitioned by date:

s3:/my_bucket/date=2020-01-01/my_data.txt
s3:/my_bucket/date=2020-01-02/my_data.txt
s3:/my_bucket/date=2020-01-03/my_data.txt

I did create a first test Delta table partitioned on date column, using SQL, then I did:

COPY INTO delta.my_table FROM (SELECT date, value FROM 's3:/my_bucket/') FILEFORMAT = TEXT

Everything is working well , then... Today I had a new partitionFolder on that bucket:

s3:/my_bucket/date=2020-01-04/my_data.txt

I have NO idea how to "copy into" only that new folders, and keep date partitioning value. Basically, I can't re-use same sql statement to copy, because it will copy previous files (other date, which are already inside delta table).

I did try by specify directly folder path like that:

COPY INTO delta.my_table FROM (SELECT date, value FROM 's3:/my_bucket/date=2020-01-04') FILEFORMAT = TEXT

but Delta/Spark is not able to get "date column", because it's read directly files and not subfolder which contains date column information.

SQL AnalysisException: cannot resolve 'date' given input columns: [value];

I wanted to add some "Pattern" matching on subfolder like

    COPY INTO delta.my_table FROM (SELECT date, value FROM 's3:/my_bucket/') FILEFORMAT = TEXT 
PÄTTERN = 'date=2020-01-04/*'

but same errors, spark is not able to understand that this folder is a new partition .

My question is quite simple, is it possible to use 'COPY INTO' from a incremental partitioned s3 buckets (like my example).

PS : My Delta table is an external table, which link to an other s3 bucket. s3:/my_bucket/ is supposed to be only landing zone and should be immutable.

COPY INTO is idempotent operation - it won't load the data from the files that were already processed, at least until you explicitly will ask it with COPY_OPTIONS(force=true) (see docs ).

You can easily prove that with small experiment - I'm using DBFS just as an example:

  • Load file with content 1 into file dbfs:/tmp/cp-test/date=2022-05-16/1.txt and import it:

在此处输入图像描述

  • Check that you have data loaded:

在此处输入图像描述

  • Create a file with content 2 , upload it to the file dbfs:/tmp/cp-test/date=2022-05-17/1.txt , and import it as well - you see that we imported only one row, although we have old files there:

在此处输入图像描述

  • And if we look into data, we see that we have only 2 rows as it's supposed:

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM