简体   繁体   中英

Loading files through spark dataframe into delta table

I have a folder in azure zen2 VAS In this folder i have files Vas_1.csv Vas_2.csv Vas_3.csv

I have to load these files using pyspark dataframe in table VAS in delta with two additional column derived in runtime load date and file_name.

Once the data is loaded in VAS table. Next day Few more files VAS_4.csv and VAS_5.csv comes into zen2 folder and now I have to load these two files in the same table VAS *Note the folder VAS in zen2 has 5 files now so for the second time load i have to skip the previously loaded file

Use the streaming mode, with a checkpoint directory, for example:

spark
.readStream
.format('csv')
.load("your_folder/*")
.writeStream
.format("delta")
.option("checkpointLocation","some_other_path_used_only_for_this_purpose")
.trigger(once=True)
.start("your_destination_folder")

This "stream" will run once and then terminate, so you can just run it on demand (or on a schedule?). When it does so, it will check the checkpoint location for the list of files it has already processed, and skip over those. Once it is done, it will update the checkpoint directory with the new files it processed.

As an alternative you can do with the copy into also. copy into skips the already loaded files and copies the remaining files from source to delta table.

But for this we need to create a delta table with the schema first. This is my sample delta table.

%sql
create table if not exists mytable5
(id long,first_name string,last_name string)
Using Delta;

These are my first 2 JSON files, and each have 5 records only with the above schema in the data container.

在此处输入图像描述

Then use copy into like this. I have used Azure SAS for the ADLS gen2.

%sql
copy into mytable5
from 'abfss://<container>@<storageaccount>.dfs.core.windows.net/folder'
with (credential (AZURE_SAS_TOKEN = '<SAS token>')
)
FILEFORMAT = JSON
FILES = ('*.json')

First copying 2 JSONs.

在此处输入图像描述

You can see 10 records are inserted . Then I am uploading a new JSON with 5 records and executing the code again.

在此处输入图像描述

Execution:

在此处输入图像描述

Now you can see only 5 records are inserted and pervious files were skipped.

First create a delta table and you can schedule the above code in Notebook as per your requirement.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM