简体   繁体   中英

Using Snowpipe - What's the best practice for loading small files. eg. Thousands of 4K files per day?

Questions

  1. How much more expensive is it to load small files (eg. 4K) using Snowpipe than say 16K, 500K or 1-10Mb (the recommended file size). Note: This question implies it is more expensive to load small files rather than the recommended 1-10Mb.

  2. Understand best practice is to load files sized 1-10Mb, but I need Near Real-Time delivery (a few minutes). I could concatenate files to make them larger, but can't wait more than 60 seconds before sending the micro-batch to S3 and therefore Snowpipe. I currently write whatever I have every 30 seconds, but I see Snowpipe reports every 60 seconds. Does this mean there is no point writing files to S3 more frequently than 60 seconds? ie. If I send the file every 30 seconds will it actually reduce average latency or is 60 seconds the minimum Snowpipe Cycle.

  3. Loading 4K files (around 200Mb a day at 4K per file), it's costing around 20 credits per gigabyte which is very expensive. What kind of cost should I expect per gigabyte using Snowpipe if I load (for example), CSV files in the 1-10Mb range? Will my cost per Gigabyte drop if I keep within the 1-10Mb range?

  4. Is there any faster/cheaper alternative to get data into Snowflake? Note: Currently using Snowpipe in Parquet format to VARIANT then using STREAMS and TASKS to restructure the data for near real-time analysis. Understand it's cheaper to use Snowpipe rather than a Virtual Warehouse. Is this true? I suspect the real answer is "it depends". But "depends upon what".

  5. In addition to my Near Real-time requirement, I have a number of systems delivering batch feeds (CSV format, approx once every 4 hours, latency expected within 30 minutes to process and present for analysis. File sizes vary here, but most are 1Mb to 1Gb range. Should I use the same Snowpipe solution or am I better off orchestrating the work from Airflow and using a COPY command followed by SQL Statements on a dedicated virtual warehouse? Or indeed, what alternative would you recommend?

  6. I can see Snowpipe loading 4K files is expensive and probably cheaper than larger files. If I load files over 10Mb in size, will these start to become more expensive again? IE. Is the cost a "bell curve" or does it flatten out.

Background

  1. I'm using Snowpipe to deliver a near real-time (NRT) data load solution.
  2. I have data being replicated from Kafka to S3 around every 30 seconds from approx 30 tables, and it's being automatically loaded to Snowflake using Snowpipe.
  3. Data passed to me in Parqet format, loaded to Variant and then a view to extract out the attributes to a table before using Tasks and SQL to restructure for analysis.
  4. In a single day, I found 50,000 files loaded, file size varies but average file size is 4K per file.
  5. I can see around 30 files per minute being loaded (ie. around 100K per minute loaded).
  6. I'm trying to balance several non-functional requirements. a) Efficient use of credits. Aware small files are expensive. b) Reduce latency (I'm trying to get a pipeline of around 2-5 minutes maximum from Kafka to dashboard). c) Simplicity - IE. It needs to be easy to understand and maintain, as I expect the solution to grow MASSIVELY - IE. From around 20 tables to many hundreds of tables - all needing Near real-time
  7. I will (in the next 3 months) have a number of CSV batch loads every 4 hours. They are entirely independent data sources (from the NRT), and have much more intensive processing and ELT. I'm wondering whether I should use Snowpipe or COPY for these.
  1. Snowpipe is serverless and billing for usage. The serverless approach has a lot less overhead than spinning up a warehouse, but some overhead remains. Therefore the more often you send information, the more it will cost. How much? Try it out, nobody can tell you that.
  2. I am not expert here, but Snowflake is not build for realtime workloads. Marketing might tell you something else. You need to expect in the worse case a couple of minutes until your data is fully refreshed. Snowflake is good at handling huge dataloads, where you can afford to wait a bit longer.
  3. Again try it out, one indicator is how much your data ingestion is keeping the warehouse busy. If it runs 1 minute but your query is finished in 1 second you might get a 60 fold cost reduction.
  4. The cheapest way should be snowpipe for your use case, assuming you are not fully occupying the warehouse.
  5. Copy into should be fine.
  6. I don't know. :) Try it out. I guess it doesn't make a big difference. You might run into problems with large files (1G+).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM