简体繁体 English

Spark 存在时 Hadoop 和 Streaming 解决方案的相关性

[英]Relevance of Hadoop & Streaming solutions when Spark exists

原文 2018-01-04 22:26:31 8 2 hadoop/ apache-spark/ apache-kafka/ apache-storm/ apache-samza

I am starting a big data initiative for my startup.我正在为我的初创公司启动一个大数据计划。 In 2018 is there any reason to use Hadoop at all since Spark is touted to be way faster due to it primarily not writing the intermediate data to disk as Hadoop's MR.在 2018 年，有什么理由完全使用 Hadoop，因为 Spark 被吹捧为更快，因为它主要不是像 Hadoop 的 MR 那样将中间数据写入磁盘。

I realize Spark has a higher need for RAM But that would be just one time CAPEX costs that would pay for itself?我意识到 Spark 对 RAM 的需求更高，但这只是一次可以收回成本的 CAPEX 成本吗？

In general unless there are legacy projects why should one pick up Hadoop at all since Spark is available?一般来说，除非有遗留项目，既然 Spark 可用，为什么还要选择 Hadoop？

Would appreciate real world comparisons of the two, gotchas etc.?会欣赏两者的现实世界比较，陷阱等吗？

Alternately are there use cases that Hadoop can solve but Spark cannot?或者，是否有 Hadoop 可以解决但 Spark 不能解决的用例？

—————-comment below for actual problem———— ——————实际问题请在下方评论————

I would use YARN as the resource manager with HDFS as the file system for Spark.我会使用 YARN 作为资源管理器，使用 HDFS 作为 Spark 的文件系统。 Also realize that as Spark intersects quiet a bit with Hadoop ecosystem.还要意识到，当 Spark 与 Hadoop 生态系统有一点交叉时。

Comparos are :比较是：

Mapreduce vs Spark code Mapreduce 与 Spark 代码
SparkSQL vs Hive SparkSQL 与 Hive
People mention Pig too but not a whole lot of people want to learn custom querying.人们也提到了 Pig，但并不是很多人想要学习自定义查询。 And if I had to use Pig as a data scientist why wouldn't I use say an Apache NiFi with Hadoop?如果我不得不使用 Pig 作为数据科学家，为什么我不使用带有 Hadoop 的 Apache NiFi？

Also not sure how Spark handles the following:也不确定 Spark 如何处理以下内容：

If data does not fit in RAM then what ?如果数据不适合 RAM 那么怎么办？ Back to a disk based paradigm (not talking of streaming use cases here..) so no better than Mapreduce?回到基于磁盘的范式（这里不讨论流用例..）所以不比 Mapreduce 好？ How does Tez make MR2 better? Tez 如何让 MR2 变得更好？
Hadoop 3 has support for Erasure coding to reduce data replication. Hadoop 3 支持擦除编码以减少数据复制。 What does Spark do?星火是做什么的？

Where I am unclear is the plethora of overlapping choices.我不清楚的是过多的重叠选择。 For eg streaming alone has:例如，单独的流媒体有：

Spark streaming火花流
Apache storm阿帕奇风暴
Apache Samza阿帕奇萨姆扎
Kafka streams卡夫卡流
CEP commercial tools.(ORacle CEP, TIBCO etc.) CEP 商业工具。（ORacle CEP、TIBCO 等）

A lot of them use DAG similar to Spark's core engine so hard to pick one from the other.他们中的很多人使用类似于 Spark 核心引擎的 DAG，因此很难从另一个中选择一个。

Use case:用例：

App sends data to middleware until end of event.应用程序将数据发送到中间件，直到事件结束。 Event can end specified on periodicity or due to a business condition being met.事件可以按指定的周期结束或由于满足业务条件而结束。
Middleware must show real time addition of a value (simplifying) sent by users from their app instances.中间件必须显示用户从其应用程序实例发送的值的实时添加（简化）。 Accepted that middleware is the floor of the actual sum of values and real value can be higher.接受中间件是地板的实际值和实际值的总和可以更高。 Plan to use Kafka streams here to have a consumer that adds all the inputs with minimal latency the consumer posts to a cache which is polled by apps to show current additive value.计划在这里使用 Kafka 流，让消费者以最小的延迟将所有输入添加到缓存中，该缓存由应用程序轮询以显示当前的附加值。
Middleware logs all input中间件记录所有输入
After event ends a big data paradigm scans through log data and database records to get accurate count by comparing all dB values and log entries (audit) and compare them to the Kafka shown value.事件结束后，大数据范式扫描日志数据和数据库记录，通过比较所有 dB 值和日志条目（审计）并将它们与 Kafka 显示的值进行比较，以获得准确的计数。 Value calculated by this scheme is the final value.此方案计算出的值为最终值。

Design choices:设计选择：

I like Kafka because it decouples application middleware and is low latency high throughput messaging.我喜欢 Kafka，因为它解耦了应用程序中间件，并且是低延迟高吞吐量的消息传递。 Streams code is easy to write . Streams 代码很容易编写。 Happy for someone to counter argue using Spark Streams Or Apache Storm or Apache Samza instead?很高兴有人使用 Spark Streams 或 Apache Storm 或 Apache Samza 来反驳争论？
Application itself is Java code on Tomcat server with REST end points for iOS/ Android clients.应用程序本身是 Tomcat 服务器上的 Java 代码，带有 iOS/Android 客户端的 REST 端点。 Not doing client caching due to explicit liveliness of additive value.由于附加值的显式活跃性，不进行客户端缓存。

2 个解决方案

You're confusing Hadoop with just MapReduce.您只是将 Hadoop 与 MapReduce 混淆了。 Hadoop is an ecosystem of MapReduce, HDFS, and YARN. Hadoop 是一个由 MapReduce、HDFS 和 YARN 组成的生态系统。

First of all, Spark doesn't have a filesystem.首先，Spark 没有文件系统。 That's primarily why Hadoop is nice, in my book.在我的书中，这就是 Hadoop 优秀的主要原因。 Sure, you can use S3, or many other cloud storages, or bare metal data stores like Ceph, or GlusterFS, but from what I've researched, HDFS is by far the fastest when processing data.当然，您可以使用 S3 或许多其他云存储，或者像 Ceph 或 GlusterFS 这样的裸机数据存储，但根据我的研究，HDFS 在处理数据时是迄今为止最快的。

Maybe you're not familiar with the concept of rack locality that YARN offers.也许您不熟悉 YARN 提供的机架局部性概念。 If you use Spark Standalone mode with any file system not mounted under the Spark executors, then all your data requests will need to be pulled over a network connection, therefore saturating the network, and causing a bottleneck, regardless of memory.如果您将 Spark 独立模式与未安装在 Spark 执行器下的任何文件系统一起使用，那么您的所有数据请求都需要通过网络连接进行拉取，从而使网络饱和，并导致瓶颈，无论内存如何。 Compare that to the Spark executors running on the YARN NodeManagers, HDFS datanodes are ideally also NodeManagers.与在 YARN 节点管理器上运行的 Spark 执行器相比，HDFS 数据节点理想情况下也是节点管理器。

A similar problem - people say Hive is slow, SparkSQL is faster.一个类似的问题 - 人们说 Hive 很慢，SparkSQL 更快。 Well, that's true if you run Hive with MapReduce instead of Tez or Spark execution modes.好吧，如果您使用 MapReduce 而不是 Tez 或 Spark 执行模式运行 Hive，那是正确的。

Now, if you're wanting streaming and real-time events rather than the batch world commonly associated with Hadoop.现在，如果您想要流式传输和实时事件，而不是通常与 Hadoop 相关的批处理世界。 You might want to research the SMACK stack.您可能想研究 SMACK 堆栈。

Update更新

Pig as a data scientist why wouldn't I use say an Apache NiFi with Hadoop作为数据科学家的 Pig 我为什么不使用 Apache NiFi 和 Hadoop

Pig is not comparable to NiFi. Pig无法与 NiFi 相提并论。

You can use NiFi;您可以使用 NiFi； nothing is stopping you.没有什么能阻止你。 It would run closer to real-time than Spark micro batches.它将比 Spark 微批次运行更接近实时。 And it is a good tool to pair with Kafka.并且是与Kafka配对的好工具。

plethora of overlapping choices过多的重叠选择

Yes, and you didn't even list them all... It's up to some BigData architect in your company to come up with a solution.是的，您甚至没有将它们全部列出来……由您公司中的某些大数据架构师来提出解决方案。 You'll find that vendor support from Confluent is mostly for Kafka.您会发现 Confluent 的供应商支持主要针对 Kafka。 I haven't seen them talking about Samza much.我还没有看到他们经常谈论 Samza。 Hortonworks will support Storm, Nifi, and Spark, but they aren't running the latest version of Kafka if you want fancy features like KSQL. Hortonworks 将支持 Storm、Nifi 和 Spark，但如果您想要像 KSQL 这样的奇特功能，它们不会运行最新版本的 Kafka。 Streamsets is a similar company offering a tool competing with NiFi which consists of employees with backgrounds in other batch/streaming Apache projects. Streamsets 是一家类似的公司，提供与 NiFi 竞争的工具，该工具由具有其他批处理/流式 Apache 项目背景的员工组成。

Storm and Samza are two ways to do the same thing, as far as I know.据我所知，Storm 和 Samza 是做同一件事的两种方式。 I think Flink is more programmer friendly than Storm.我认为 Flink 比 Storm 对程序员更友好。 I don't have experience with Samza, though I work closely with people who primarily are using Kafka Streams rather than it.我没有使用 Samza 的经验，但我与主要使用 Kafka Streams 而不是它的人密切合作。 And Kafka Streams isn't DAG based - it's just a high level Kafka library, embeddable in any JVM application.并且 Kafka Streams 不是基于 DAG - 它只是一个高级 Kafka 库，可以嵌入到任何 JVM 应用程序中。

If data does not fit in RAM then what ?如果数据不适合 RAM 那么怎么办？

By default, it spills to disk... Spark has parameters to configure if you don't want disk to be touched.默认情况下，它会溢出到磁盘...如果您不想触摸磁盘，Spark 有参数可以配置。 In which case, your jobs die of OOM more quickly, obviously.在这种情况下，您的工作显然会更快地死于 OOM。

How does Tez make MR2 better? Tez 如何让 MR2 变得更好？

Tez isn't MR. Tez 不是 MR。 It creates more optimized DAGs like Spark does.它像 Spark 一样创建了更优化的 DAG。 Go read about it .去读一读吧。

Hadoop 3 has support for Erasure coding to reduce data replication. Hadoop 3 支持擦除编码以减少数据复制。 What does Spark do?星火是做什么的？

Spark has no filesystem. Spark 没有文件系统。 We already covered this.我们已经涵盖了这一点。 Erasure encoding is primarily for data at-rest, not during processing.擦除编码主要用于静态数据，而不是处理期间的数据。 I actually don't know if Spark supports Hadoop 3, yet.我实际上还不知道 Spark 是否支持 Hadoop 3。

Application itself is Java code on Tomcat server with REST end points for iOS/ Android clients应用程序本身是 Tomcat 服务器上的 Java 代码，带有 iOS/Android 客户端的 REST 端点

Personally, I would use Kafka Streams here because 1) You are using Java already 2) it's a standalone thread in your code that offers you to read/publish data from Kafka without Hadoop/YARN or Spark Clusters.就个人而言，我会在这里使用 Kafka Streams，因为 1) 您已经在使用 Java 2) 它是您代码中的一个独立线程，可让您在没有 Hadoop/YARN 或 Spark 集群的情况下从 Kafka 读取/发布数据。 It's not clear what your question has to do with Hadoop from your listed client-server archictecture, but feel free to string an additional line from a Kafka topic to a database/analytics engine of your choice.从您列出的客户端-服务器架构中，不清楚您的问题与 Hadoop 有什么关系，但您可以随意将 Kafka 主题的附加行添加到您选择的数据库/分析引擎。 The Kafka Connect framework has many connectors for you to choose from . Kafka Connect 框架有许多连接器供您选择。

You could also use NiFi as your mobile REST API to just ExposeHTTP and send requests to it, then route flows based on attributes in the data.您还可以使用 NiFi 作为您的移动 REST API 来仅使用 ExposeHTTP 并向其发送请求，然后根据数据中的属性路由流。 Then, manipulate and publish to Kafka as well as other systems.然后，操作并发布到 Kafka 以及其他系统。

Spark and Hadoop works pretty similar in the way of solving MapReduce problems. Spark 和 Hadoop 在解决 MapReduce 问题的方式上非常相似。

Hadoop is pretty relevant if you talk about HDFS point of view.如果您谈论 HDFS 的观点，Hadoop 是非常相关的。 The HDFS is a well known used solution for big data storage. HDFS 是众所周知的用于大数据存储的解决方案。 But your question is about MapReduce.但是您的问题是关于 MapReduce。

Spark is the best option if you are talking about good machines with real good configuration of memory and network throughput.如果您谈论的是具有真正良好的内存和网络吞吐量配置的好机器，Spark 是最佳选择。 But we know that kind of machines are expensive and sometimes you best option is to use Hadoop to process your data.但是我们知道这种机器很昂贵，有时您最好的选择是使用 Hadoop 来处理您的数据。 Spark is great and fast but sometimes you get crazy with Memory issues if you don't have a good cluster in case of fit too much data in the memory. Spark 很棒而且速度很快，但有时如果你没有一个好的集群，以防内存中放入太多数据，你会因为内存问题而发疯。 Hadoop in this case can be better.在这种情况下，Hadoop 可能会更好。 But this problem year after year are less relevant.但是这个问题年复一年地不太相关。

So hadoop is here com complement Spark, Hadoop is not only MapReduce Hadoop is an ecosystem.所以hadoop在这里com是对Spark的补充，Hadoop不仅仅是MapReduce，Hadoop是一个生态系统。 Spark doesn't have a distributed file system, to Spark works well you need one, Spark doesn't have a resource manager, Hadoop has called Yarn. Spark 没有分布式文件系统，为了让 Spark 运行良好，您需要一个，Spark 没有资源管理器，Hadoop 称为 Yarn。 And Spark in a cluster mode need a resource manager.而集群模式下的 Spark 需要一个资源管理器。

Conclusion结论

Hadoop still relevant as an ecosystem but as only mapReduce I can say that is not been used anymore. Hadoop 作为一个生态系统仍然具有相关性，但作为唯一的 mapReduce，我可以说它不再被使用。