I have external stage organized as follows:
s3://finance/credits
/Week_2022_0601_0607
file01.json
file02.json
/Week_2022_0608_0615
file01.json
file02.json
file03.json
etc... New folders will get added each week
Can I define my storage_location property for my external stage as:
"s3://finance/credits/./*.json"
so that in my COPY INTO... code, snowflake will automatically traverse the nested "date info" related folder and load all the files? Since new folders will be added each week, I cannot create multiple hard-coded folders in the stage storage_location path for the stage.
This really applies to any path - COPY INTO with or without using a Stage.
In the Snowflake Citibike Lab
You create a stage like:
create stage citibike.public.citibike_trips
url = 's3://snowflake-workshop-lab/citibike-trips';
a file format like:
create file format citibike.public.csv type = csv
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
NULL_IF = ('\\N', '');
then load the files like:
copy into trips
from @citibike_trips
file_format = csv
PATTERN= '.*trips_.*csv.gz';
anyways, a S3 Object name is not a PATH, it is just a string, which looks like a path, and thus when you match the path, ALL files that match are returned.
This point should be strongly considered as as your set of files builds up, you can start have millions of files in S3, and that full list will be transferred to Snowflake on each operation.
Anyways Snowfalke keeps a list of the files loaded in the last 2 weeks and does not reload these if they have not changed. Files older than 2 weeks are assumed not changed and ignored.
The standard advice is to track a high water mark, and have you folder/path hierarchical year-month-week/day
so you can use progressive path filters, to reduce the LIST size transfer.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.