简体繁体 English

如何使用具有 azure stream 分析的超大数据库来丰富事件？

[英]How to enrich events using a very large database with azure stream analytics?

原文 2022-03-13 16:10:43 5 1 azure-stream-analytics/ stream-analytics

I'm in the process of analyzing Azure Stream Analytics to replace a stream processing solutions based on NiFi with some REST microservices.我正在分析 Azure Stream Analytics，以用一些 REST 微服务替换基于 NiFi 的 stream 处理解决方案。

One step is the enrichment of sensor data form a very large database of sensors (>120Gb).第一步是从一个非常大的传感器数据库 (>120Gb) 中丰富传感器数据。

Is it possible with Azure Stream Analytics? Azure Stream 分析有可能吗？ I tried with a very small subset of the data (60Mb) and couldn't even get it to run.我尝试使用非常小的数据子集 (60Mb)，甚至无法运行它。

Job logs give me warnings of memory usage being too high.作业日志警告我 memory 使用率过高。 Tried scaling to 36 stream units to see if it was even possible, to no avail.尝试缩放到 36 stream 单位以查看是否可行，但无济于事。

What strategies do I have to make it work?我必须采取什么策略才能使其发挥作用？

If I deterministically (via a hash function) partition the input stream using N partitions by ID and then partition the database using the same hash function (to get id on stream and ID on database to the same partition) can I make this work?如果我确定性地（通过 hash 函数）按 ID 使用 N 个分区对输入 stream 进行分区，然后使用相同的 hash function 对数据库进行分区（以获取 stream 上的 ID 和数据库上的 ID 到同一个分区），我可以进行这项工作吗？ Do I need to create several separated stream analytics jobs do be able to do that?我是否需要创建多个单独的 stream 分析作业才能做到这一点？

I suppose I can use 5Gb chunks, but I could not get it to work with ADSL Gen2 datalake.我想我可以使用 5Gb 块，但我无法让它与 ADSL Gen2 数据湖一起使用。 Does it really only works with Azure SQL?它真的只适用于 Azure SQL 吗？

1 个解决方案

Stream Analytics supports reference datasets of up to 5GB . Stream Analytics 支持最大 5GB的参考数据集。 Please note that large reference datasets come with the downside of making jobs/nodes restarts very slow (up to 20 minutes for the ref data to be distributed; restarts that may be user initiated, for service updates, or various errors).请注意，大型参考数据集的缺点是使作业/节点重启非常缓慢（分发参考数据最多需要 20 分钟；重启可能是用户启动的、服务更新或各种错误）。

If you can downsize that 120Gb to 5Gb (scoping only the columns and rows you need, converting to types that are smaller in size), then you should be able to run that workload.如果您可以将 120Gb 缩小到 5Gb（仅确定您需要的列和行的范围，转换为尺寸更小的类型），那么您应该能够运行该工作负载。 Sadly we don't support partitioned reference data yet.遗憾的是，我们还不支持分区参考数据。 This means that as of now, if you have to use ASA, and can't reduce those 120Gb, then you will have to deploy 1 distinct job for each subset of stream/reference data.这意味着，截至目前，如果您必须使用 ASA，并且不能减少这 120Gb，那么您将不得不为流/参考数据的每个子集部署 1 个不同的作业。

Now I'm surprised you couldn't get a 60Mb ref data to run, if you have details on what exactly went wrong, I'm happy to provide guidance.现在我很惊讶你无法运行 60Mb ref 数据，如果你有详细的错误信息，我很乐意提供指导。