用于任务并行化的火花流

Question

I am designing a system with the following flow: 我正在设计一个具有以下流程的系统：

Download feed files (line based) over the network 通过网络下载供稿文件（基于行）
Parse the elements into objects 将元素解析为对象
Filter invalid / unnecessary objects 过滤无效/不必要的对象
Execute blocking IO (HTTP Request) on part of the elements 在部分元素上执行阻止IO（HTTP请求）
Save to DB 保存到数据库

I have been considering implementing the system using Spark-streaming mainly for tasks parallelization, resource management, fault tolerance, etc. 我一直在考虑使用Spark流技术来实现该系统，主要用于任务并行化，资源管理，容错等。

But I am not sure this is the right use-case for spark streaming, as I am not using it only for metrics and data processing. 但是我不确定这是火花流的正确用例，因为我不是仅将其用于度量标准和数据处理。 Also I'm not sure how Spark-streaming handles blocking IO tasks. 另外，我不确定Spark-streaming如何处理阻塞的IO任务。

Is Spark-streaming suitable for this use-case? Spark流是否适合此用例？ Or maybe I should look for another technology/framework? 还是我应该寻找其他技术/框架？

Answer 1

Spark is, at its heart, a general parallel computing framework. 从本质上讲，Spark是一个通用的并行计算框架。 Spark Streaming adds an abstraction to support stream processing using micro-batching. Spark Streaming添加了一种抽象，以支持使用微批处理的流处理。 We can certainly implement such an use case on Spark Streaming. 我们当然可以在Spark Streaming上实现这样的用例。

To 'fan-out' the I/O operations, we need to ensure the right level of parallelism at two levels: 为了“散布” I / O操作，我们需要在两个级别上确保正确的并行级别：

First, distribute the data evenly across partitions: The initial partitioning of the data will depend on the streaming source used. 首先，将数据均匀地分布在各个分区上：数据的初始分区将取决于所使用的流源。 For this usecase, it would look like a custom receiver could be the way to go. 对于此用例，似乎可以采用custom receiver 。 After the batch is received, we probably need to use dstream.repartition(n) to a larger number of partitions that should roughly match 2-3x the number of executors allocated for the job. 收到批处理后，我们可能需要对更多分区使用dstream.repartition(n) ，这些分区应大致与为该作业分配的执行程序的2-3倍匹配。
Spark uses 1 core (configurable) for each task executed. Spark对每个执行的任务使用1个内核（可配置）。 Tasks are executed per partition. 每个分区执行任务。 This makes the assumption that our task is CPU intensive and requires a full CPU. 这假定我们的任务需要大量CPU，并且需要一个完整的CPU。 To optimize execution for blocking I/O, we would like to multiplex that core for many operations. 为了优化用于阻止I / O的执行，我们希望对该内核进行多路复用以进行许多操作。 We do this by operating directly on the partitions and using classical concurrent programming to parallelize our work. 为此，我们直接在分区上进行操作，并使用经典的并发编程来并行化我们的工作。

Given the original stream of feedLinesDstream , we could so something like: (* in Scala. Java version should be similar, but like x times more LOC) 给定feedLinesDstream的原始流，我们可以这样：（*在Scala中。Java版本应该相似，但LOC feedLinesDstream多x倍）

val feedLinesDstream = ??? // the original dstream of feed lines
val parsedElements = feedLinesDstream.map(parseLine)
val validElements = parsedElements.filter(isValid _)
val distributedElements = validElements.repartition(n) // n = 2 to 3 x #of executors

// multiplex execution at the level of each partition
val data =  distributedElements.mapPartitions{ iter =>
   implicit executionContext = ??? // obtain a thread pool for execution
   val futures = iter.map(elem => Future(ioOperation(elem)))
   // traverse the future resulting in a future collection of results
   val res = Future.sequence(future) 
   Await.result(res, timeout)
}
data.saveToCassandra(keyspace, table)

Answer 2

Is Spark-streaming suitable for this use-case? Spark流是否适合此用例？ Or maybe I should look for another technology/framework? 还是我应该寻找其他技术/框架？

When considering using Spark, you should ask yourself a few questions: 在考虑使用Spark时，您应该问自己几个问题：

What is the scale of my application in it's current state and where will it grow to in the future? 我的应用程序在当前状态下的规模是多少，将来会增长到什么程度？ (Spark is generally meant for Big Data applications where millions of processes will happen a second) （火花通常用于每秒将发生数百万个流程的大数据应用程序 ）
What language is my preferred? 我首选哪种语言？ (Spark can implemented in Java, Scala, Python, and R ) （可以在Java，Scala，Python和R中实现Spark）
What database will I be using? 我将使用哪个数据库？ (Technologies like Apache Spark are normally implemented with large DB structures like HBase ) （诸如Apache Spark之类的技术通常是通过诸如HBase之类的大型数据库结构来实现的）

Also I'm not sure how Spark-streaming handles blocking IO tasks. 另外，我不确定Spark-streaming如何处理阻塞的IO任务。

There is already an answer on Stack Overflow about blocking IO tasks using Spark in Scala. 关于堆栈溢出的答案已经存在，有关在Scala中使用Spark阻止IO任务。 It should give you a start, but to answer that question, yes it is possible. 它应该给您一个开始，但要回答该问题， 是的，这是可能的。

Lastly, reading documentation is important and you can find Spark's right here . 最后，阅读文档很重要，您可以在此处找到Spark的权利。

用于任务并行化的火花流

问题描述

2 个解决方案

解决方案1
2 2016-06-05 18:37:48

解决方案2
1 2016-06-02 14:01:05

用于任务并行化的火花流

问题描述

2 个解决方案

解决方案1 2 2016-06-05 18:37:48

解决方案2 1 2016-06-02 14:01:05

解决方案1
2 2016-06-05 18:37:48

解决方案2
1 2016-06-02 14:01:05