简体繁体 English

使用 Snowpipe - 加载小文件的最佳做法是什么。例如。每天数千个 4K 文件？

[英]Using Snowpipe - What's the best practice for loading small files. eg. Thousands of 4K files per day?

原文 2020-02-11 12:08:32 6 1 snowflake-cloud-data-platform/ data-ingestion/ near-real-time

Questions问题

How much more expensive is it to load small files (eg. 4K) using Snowpipe than say 16K, 500K or 1-10Mb (the recommended file size).使用 Snowpipe 加载小文件（例如 4K）比 16K、500K 或 1-10Mb（推荐的文件大小）要贵多少。 Note: This question implies it is more expensive to load small files rather than the recommended 1-10Mb.注意：这个问题意味着加载小文件比推荐的 1-10Mb 更昂贵。
Understand best practice is to load files sized 1-10Mb, but I need Near Real-Time delivery (a few minutes).了解最佳做法是加载大小为 1-10Mb 的文件，但我需要近乎实时的交付（几分钟）。 I could concatenate files to make them larger, but can't wait more than 60 seconds before sending the micro-batch to S3 and therefore Snowpipe.我可以连接文件使它们变大，但在将微批次发送到 S3 和 Snowpipe 之前不能等待超过 60 秒。 I currently write whatever I have every 30 seconds, but I see Snowpipe reports every 60 seconds.我目前每 30 秒写一次我有的任何东西，但我每 60 秒看到一次 Snowpipe 报告。 Does this mean there is no point writing files to S3 more frequently than 60 seconds?这是否意味着将文件写入 S3 的频率超过 60 秒没有意义？ ie. IE。 If I send the file every 30 seconds will it actually reduce average latency or is 60 seconds the minimum Snowpipe Cycle.如果我每 30 秒发送一次文件，它实际上会减少平均延迟还是 60 秒的最小 Snowpipe 周期。
Loading 4K files (around 200Mb a day at 4K per file), it's costing around 20 credits per gigabyte which is very expensive.加载 4K 文件（每天大约 200Mb，每个文件 4K），每 GB 花费大约 20 个积分，这是非常昂贵的。 What kind of cost should I expect per gigabyte using Snowpipe if I load (for example), CSV files in the 1-10Mb range?如果我加载（例如）CSV 个 1-10Mb 范围内的文件，使用 Snowpipe 每 GB 的成本应该是多少？ Will my cost per Gigabyte drop if I keep within the 1-10Mb range?如果我保持在 1-10Mb 范围内，我的每 GB 成本会下降吗？
Is there any faster/cheaper alternative to get data into Snowflake?有没有更快/更便宜的替代方法可以将数据导入 Snowflake？ Note: Currently using Snowpipe in Parquet format to VARIANT then using STREAMS and TASKS to restructure the data for near real-time analysis.注意：目前使用 Parquet 格式的 Snowpipe 到 VARIANT，然后使用 STREAMS 和 TASKS 重组数据以进行近乎实时的分析。 Understand it's cheaper to use Snowpipe rather than a Virtual Warehouse.了解使用 Snowpipe 比使用虚拟仓库更便宜。 Is this true?这是真的？ I suspect the real answer is "it depends".我怀疑真正的答案是“视情况而定”。 But "depends upon what".但是“取决于什么”。
In addition to my Near Real-time requirement, I have a number of systems delivering batch feeds (CSV format, approx once every 4 hours, latency expected within 30 minutes to process and present for analysis. File sizes vary here, but most are 1Mb to 1Gb range. Should I use the same Snowpipe solution or am I better off orchestrating the work from Airflow and using a COPY command followed by SQL Statements on a dedicated virtual warehouse? Or indeed, what alternative would you recommend?除了我的近实时要求外，我还有许多系统提供批量提要（CSV 格式，大约每 4 小时一次，预计在 30 分钟内延迟处理和呈现以供分析。文件大小在这里有所不同，但大多数是 1Mb到 1Gb 范围。我应该使用相同的 Snowpipe 解决方案，还是最好从 Airflow 编排工作并使用 COPY 命令，然后在专用虚拟仓库上使用 SQL 语句？或者实际上，您会推荐什么替代方案？
I can see Snowpipe loading 4K files is expensive and probably cheaper than larger files.我可以看到 Snowpipe 加载 4K 文件很昂贵，而且可能比更大的文件更便宜。 If I load files over 10Mb in size, will these start to become more expensive again?如果我加载超过 10Mb 的文件，这些文件会再次变得更昂贵吗？ IE. IE。 Is the cost a "bell curve" or does it flatten out.成本是“钟形曲线”还是变平。

Background背景

I'm using Snowpipe to deliver a near real-time (NRT) data load solution.我正在使用 Snowpipe 提供近乎实时 (NRT) 的数据加载解决方案。
I have data being replicated from Kafka to S3 around every 30 seconds from approx 30 tables, and it's being automatically loaded to Snowflake using Snowpipe.我大约每 30 秒从大约 30 个表中将数据从 Kafka 复制到 S3，并且使用 Snowpipe 将其自动加载到 Snowflake。
Data passed to me in Parqet format, loaded to Variant and then a view to extract out the attributes to a table before using Tasks and SQL to restructure for analysis.数据以 Parqet 格式传递给我，加载到 Variant，然后在使用 Tasks 和 SQL 重组分析之前将属性提取到表中。
In a single day, I found 50,000 files loaded, file size varies but average file size is 4K per file.在一天之内，我发现加载了 50,000 个文件，文件大小各不相同，但平均文件大小为每个文件 4K。
I can see around 30 files per minute being loaded (ie. around 100K per minute loaded).我可以看到每分钟加载大约 30 个文件（即每分钟加载大约 100K）。
I'm trying to balance several non-functional requirements.我试图平衡几个非功能性需求。 a) Efficient use of credits. a) 有效使用信用。 Aware small files are expensive.意识到小文件很昂贵。 b) Reduce latency (I'm trying to get a pipeline of around 2-5 minutes maximum from Kafka to dashboard). b) 减少延迟（我试图获得从 Kafka 到仪表板的最长 2-5 分钟的管道）。 c) Simplicity - IE. c) 简单 - IE。 It needs to be easy to understand and maintain, as I expect the solution to grow MASSIVELY - IE.它需要易于理解和维护，因为我希望解决方案能够大规模增长 - IE。 From around 20 tables to many hundreds of tables - all needing Near real-time从大约 20 个表到数百个表 - 都需要近实时
I will (in the next 3 months) have a number of CSV batch loads every 4 hours.我将（在接下来的 3 个月内）每 4 小时进行 CSV 次批量加载。 They are entirely independent data sources (from the NRT), and have much more intensive processing and ELT.它们是完全独立的数据源（来自 NRT），并且具有更密集的处理和 ELT。 I'm wondering whether I should use Snowpipe or COPY for these.我想知道是否应该为这些使用 Snowpipe 或 COPY。

1 个解决方案

Snowpipe is serverless and billing for usage. Snowpipe 是无服务器的，并且按使用量计费。 The serverless approach has a lot less overhead than spinning up a warehouse, but some overhead remains.与建立仓库相比，无服务器方法的开销要少得多，但仍然存在一些开销。 Therefore the more often you send information, the more it will cost.因此，您发送信息的频率越高，成本就越高。 How much?多少？ Try it out, nobody can tell you that.试试看，没人能告诉你。
I am not expert here, but Snowflake is not build for realtime workloads.我不是这方面的专家，但 Snowflake 不是为实时工作负载而构建的。 Marketing might tell you something else.市场营销可能会告诉您其他信息。 You need to expect in the worse case a couple of minutes until your data is fully refreshed.在最坏的情况下，您需要等待几分钟才能完全刷新数据。 Snowflake is good at handling huge dataloads, where you can afford to wait a bit longer. Snowflake 擅长处理巨大的数据负载，您可以在其中等待更长的时间。
Again try it out, one indicator is how much your data ingestion is keeping the warehouse busy.再次尝试，一个指标是您的数据摄取使仓库繁忙的程度。 If it runs 1 minute but your query is finished in 1 second you might get a 60 fold cost reduction.如果它运行 1 分钟但您的查询在 1 秒内完成，您可能会获得 60 倍的成本降低。
The cheapest way should be snowpipe for your use case, assuming you are not fully occupying the warehouse.假设您没有完全占用仓库，最便宜的方法应该是您的用例的雪管。
Copy into should be fine.复制进去应该没问题。
I don't know.我不知道。 :) Try it out. ：）试试看。 I guess it doesn't make a big difference.我想这没什么大不了的。 You might run into problems with large files (1G+).您可能会遇到大文件 (1G+) 的问题。