简体繁体 English

我们如何使用来自多个源的 spark 流？例如，首先从 HDFS 获取数据，然后从 Kafka 消费流式传输

[英]How can we use streaming in spark from multiple source? e.g First take data from HDFS and then consume streaming from Kafka

原文 2022-08-27 08:41:22 4 1 mysql/ apache-spark/ hadoop/ apache-kafka/ spark-streaming

The problem arise when I already have a system and I want to implement a Spark Streaming on top.当我已经有一个系统并且我想在上面实现一个 Spark Streaming 时，就会出现问题。 I have 50 million rows transactional data on MySQL, I want to do reporting on those data.我在 MySQL 上有 5000 万行事务数据，我想报告这些数据。 I thought to dump the data into HDFS.我想将数据转储到 HDFS 中。 Now, Data are coming everyday also in DB and I am adding KAFKA for new data.现在，数据库中也每天都有数据出现，我正在为新数据添加 KAFKA。

I want to know how can I combine multiple source data and do analytics in real-time (1-2 minutes delay is ok) and save those results because future data needs previous results.我想知道如何组合多个源数据并实时进行分析（延迟 1-2 分钟即可）并保存这些结果，因为未来的数据需要以前的结果。

1 个解决方案

Joins are possible in SparkSQL, but what happens when you need to update data in mysql?在 SparkSQL 中可以进行连接，但是当您需要更新 mysql 中的数据时会发生什么？ Then your HDFS data becomes invalid very quickly (faster than a few minutes, for sure).然后您的 HDFS 数据很快就会失效（肯定会超过几分钟）。 Tip: Spark can use JDBC rather than need HDFS exports.提示：Spark 可以使用 JDBC 而不是需要 HDFS 导出。

Without knowing more about your systems, I say keep the mysql database running, as there is probably something else actively using it.在不了解您的系统的更多信息的情况下，我说保持 mysql 数据库运行，因为可能还有其他东西在积极使用它。 If you want to use Kafka, then that's a continous feed of data, but HDFS/MySQL are not.如果你想使用 Kafka，那么这是一个连续的数据馈送，但 HDFS/MySQL 不是。 Combining remote batch lookups with streams will be slow (could be more than few minutes).将远程批量查找与流结合起来会很慢（可能超过几分钟）。

However, if you use Debezium to get data into Kafka from mysql , then you then have data centralized in one location, and then ingest from Kafka into an indexable location such as Druid, Apache Pinot, Clickhouse, or maybe ksqlDB to ingest.但是，如果您使用 Debezium 将数据从 mysql 获取到 Kafka ，那么您将数据集中在一个位置，然后从 Kafka 摄取到可索引的位置，例如 Druid、Apache Pinot、Clickhouse 或 ksqlDB 来摄取。

Query from those, as they are purpose built for that use case, and you don't need Spark.从中进行查询，因为它们是专门为该用例构建的，并且您不需要 Spark。 Pick one or more as they each support different use cases / query patterns.选择一个或多个，因为它们每个都支持不同的用例/查询模式。