[英]Using Snowpipe - What's the best practice for loading small files. eg. Thousands of 4K files per day?
Questions问题
How much more expensive is it to load small files (eg. 4K) using Snowpipe than say 16K, 500K or 1-10Mb (the recommended file size).使用 Snowpipe 加载小文件(例如 4K)比 16K、500K 或 1-10Mb(推荐的文件大小)要贵多少。 Note: This question implies it is more expensive to load small files rather than the recommended 1-10Mb.
注意:这个问题意味着加载小文件比推荐的 1-10Mb 更昂贵。
Understand best practice is to load files sized 1-10Mb, but I need Near Real-Time delivery (a few minutes).了解最佳做法是加载大小为 1-10Mb 的文件,但我需要近乎实时的交付(几分钟)。 I could concatenate files to make them larger, but can't wait more than 60 seconds before sending the micro-batch to S3 and therefore Snowpipe.
我可以连接文件使它们变大,但在将微批次发送到 S3 和 Snowpipe 之前不能等待超过 60 秒。 I currently write whatever I have every 30 seconds, but I see Snowpipe reports every 60 seconds.
我目前每 30 秒写一次我有的任何东西,但我每 60 秒看到一次 Snowpipe 报告。 Does this mean there is no point writing files to S3 more frequently than 60 seconds?
这是否意味着将文件写入 S3 的频率超过 60 秒没有意义? ie.
IE。 If I send the file every 30 seconds will it actually reduce average latency or is 60 seconds the minimum Snowpipe Cycle.
如果我每 30 秒发送一次文件,它实际上会减少平均延迟还是 60 秒的最小 Snowpipe 周期。
Loading 4K files (around 200Mb a day at 4K per file), it's costing around 20 credits per gigabyte which is very expensive.加载 4K 文件(每天大约 200Mb,每个文件 4K),每 GB 花费大约 20 个积分,这是非常昂贵的。 What kind of cost should I expect per gigabyte using Snowpipe if I load (for example), CSV files in the 1-10Mb range?
如果我加载(例如)CSV 个 1-10Mb 范围内的文件,使用 Snowpipe 每 GB 的成本应该是多少? Will my cost per Gigabyte drop if I keep within the 1-10Mb range?
如果我保持在 1-10Mb 范围内,我的每 GB 成本会下降吗?
Is there any faster/cheaper alternative to get data into Snowflake?有没有更快/更便宜的替代方法可以将数据导入 Snowflake? Note: Currently using Snowpipe in Parquet format to VARIANT then using STREAMS and TASKS to restructure the data for near real-time analysis.
注意:目前使用 Parquet 格式的 Snowpipe 到 VARIANT,然后使用 STREAMS 和 TASKS 重组数据以进行近乎实时的分析。 Understand it's cheaper to use Snowpipe rather than a Virtual Warehouse.
了解使用 Snowpipe 比使用虚拟仓库更便宜。 Is this true?
这是真的? I suspect the real answer is "it depends".
我怀疑真正的答案是“视情况而定”。 But "depends upon what".
但是“取决于什么”。
In addition to my Near Real-time requirement, I have a number of systems delivering batch feeds (CSV format, approx once every 4 hours, latency expected within 30 minutes to process and present for analysis. File sizes vary here, but most are 1Mb to 1Gb range. Should I use the same Snowpipe solution or am I better off orchestrating the work from Airflow and using a COPY command followed by SQL Statements on a dedicated virtual warehouse? Or indeed, what alternative would you recommend?除了我的近实时要求外,我还有许多系统提供批量提要(CSV 格式,大约每 4 小时一次,预计在 30 分钟内延迟处理和呈现以供分析。文件大小在这里有所不同,但大多数是 1Mb到 1Gb 范围。我应该使用相同的 Snowpipe 解决方案,还是最好从 Airflow 编排工作并使用 COPY 命令,然后在专用虚拟仓库上使用 SQL 语句?或者实际上,您会推荐什么替代方案?
I can see Snowpipe loading 4K files is expensive and probably cheaper than larger files.我可以看到 Snowpipe 加载 4K 文件很昂贵,而且可能比更大的文件更便宜。 If I load files over 10Mb in size, will these start to become more expensive again?
如果我加载超过 10Mb 的文件,这些文件会再次变得更昂贵吗? IE.
IE。 Is the cost a "bell curve" or does it flatten out.
成本是“钟形曲线”还是变平。
Background背景
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.