简体   繁体   English

我们如何使用来自多个源的 spark 流? 例如,首先从 HDFS 获取数据,然后从 Kafka 消费流式传输

[英]How can we use streaming in spark from multiple source? e.g First take data from HDFS and then consume streaming from Kafka

The problem arise when I already have a system and I want to implement a Spark Streaming on top.当我已经有一个系统并且我想在上面实现一个 Spark Streaming 时,就会出现问题。 I have 50 million rows transactional data on MySQL, I want to do reporting on those data.我在 MySQL 上有 5000 万行事务数据,我想报告这些数据。 I thought to dump the data into HDFS.我想将数据转储到 HDFS 中。 Now, Data are coming everyday also in DB and I am adding KAFKA for new data.现在,数据库中也每天都有数据出现,我正在为新数据添加 KAFKA。

I want to know how can I combine multiple source data and do analytics in real-time (1-2 minutes delay is ok) and save those results because future data needs previous results.我想知道如何组合多个源数据并实时进行分析(延迟 1-2 分钟即可)并保存这些结果,因为未来的数据需要以前的结果。

Joins are possible in SparkSQL, but what happens when you need to update data in mysql?在 SparkSQL 中可以进行连接,但是当您需要更新 mysql 中的数据时会发生什么? Then your HDFS data becomes invalid very quickly (faster than a few minutes, for sure).然后您的 HDFS 数据很快就会失效(肯定会超过几分钟)。 Tip: Spark can use JDBC rather than need HDFS exports.提示:Spark 可以使用 JDBC 而不是需要 HDFS 导出。

Without knowing more about your systems, I say keep the mysql database running, as there is probably something else actively using it.在不了解您的系统的更多信息的情况下,我说保持 mysql 数据库运行,因为可能还有其他东西在积极使用它。 If you want to use Kafka, then that's a continous feed of data, but HDFS/MySQL are not.如果你想使用 Kafka,那么这是一个连续的数据馈送,但 HDFS/MySQL 不是。 Combining remote batch lookups with streams will be slow (could be more than few minutes).将远程批量查找与流结合起来会很慢(可能超过几分钟)。

However, if you use Debezium to get data into Kafka from mysql , then you then have data centralized in one location, and then ingest from Kafka into an indexable location such as Druid, Apache Pinot, Clickhouse, or maybe ksqlDB to ingest.但是,如果您使用 Debezium 将数据从 mysql 获取到 Kafka ,那么您将数据集中在一个位置,然后从 Kafka 摄取到可索引的位置,例如 Druid、Apache Pinot、Clickhouse 或 ksqlDB 来摄取。

Query from those, as they are purpose built for that use case, and you don't need Spark.从中进行查询,因为它们是专门为该用例构建的,并且您不需要 Spark。 Pick one or more as they each support different use cases / query patterns.选择一个或多个,因为它们每个都支持不同的用例/查询模式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 来自 kafka spark 2.4.5 的 Spark 结构化流不会保存到 mysql db - Spark structured streaming from kafka spark 2.4.5 not saving into mysql db 使用火花流从数据库中读取流 - Stream reading from database using spark streaming 如何按顺序使用来自 kafka 的消息? - How can I consume message from kafka in order? 从HDFS读取到Spark - Reading from HDFS into Spark Laravel make:auth —如何从1个表单发送2个查询? 例如:通过电子邮件发送到一个表,将用户名发送到另一个表 - Laravel make:auth — How do I send 2 queries from 1 form? e.g: email to one table & username to another table 我们需要从mysql检索所有行。 并且表只有一个列,例如name。哪种方法更好? - we need to retrieve all the rows from mysql. And table has only a single column e.g name.Which approach is better? 如何将流 JSON 数据从控制台插入 MySQL 数据库 - How to INSERT Streaming JSON data from Console to a MySQL database postgresql,将未签名的数字(从mysql)转换为已签名(例如,对于smallint,将65535转换为-32768) - postgresql, convert unsigned number (from mysql) to signed (e.g 65535 to -32768 for smallint) MySQL 查询以从表中显示一个人最近 8 项运动结果(例如 W、L 或 D) - MySQL query to show a persons last 8 sporting results (e.g W, L, or D) from a table 通过Kafka和Spark消费大数据 - Consume a big data by Kafka and Spark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM