简体   繁体   English

apache spark 可以在没有 hadoop 的情况下运行吗?

[英]Can apache spark run without hadoop?

Are there any dependencies between Spark and Hadoop ? SparkHadoop之间是否存在依赖关系?

If not, are there any features I'll miss when I run Spark without Hadoop ?如果没有,当我在没有Hadoop的情况下运行Spark时,我会错过任何功能吗?

Spark is an in-memory distributed computing engine. Spark是一种内存分布式计算引擎。

Hadoop is a framework for distributed storage ( HDFS ) and distributed processing ( YARN ). Hadoop是分布式存储 ( HDFS ) 和分布式处理 ( YARN ) 的框架。

Spark can run with or without Hadoop components (HDFS/YARN) Spark 可以使用或不使用 Hadoop 组件(HDFS/YARN)运行


Distributed Storage:分布式存储:

Since Spark does not have its own distributed storage system, it has to depend on one of these storage systems for distributed computing.由于Spark没有自己的分布式存储系统,所以它必须依赖这些存储系统之一进行分布式计算。

S3 – Non-urgent batch jobs. S3 – 非紧急批处理作业。 S3 fits very specific use cases when data locality isn't critical.当数据局部性不重要时,S3 适合非常特定的用例。

Cassandra – Perfect for streaming data analysis and an overkill for batch jobs. Cassandra – 非常适合流式数据分析和批处理作业的过度杀伤。

HDFS – Great fit for batch jobs without compromising on data locality. HDFS – 非常适合批处理作业,而不会影响数据局部性。


Distributed processing:分布式处理:

You can run Spark in three different modes: Standalone, YARN and Mesos你可以在三种不同的模式下运行 Spark: Standalone、YARN 和 Mesos

Have a look at the below SE question for a detailed explanation about both distributed storage and distributed processing.查看下面的 SE 问题,详细了解分布式存储和分布式处理。

Which cluster type should I choose for Spark? 我应该为 Spark 选择哪种集群类型?

Spark can run without Hadoop but some of its functionality relies on Hadoop's code (eg handling of Parquet files). Spark 可以在没有 Hadoop 的情况下运行,但它的一些功能依赖于 Hadoop 的代码(例如处理 Parquet 文件)。 We're running Spark on Mesos and S3 which was a little tricky to set up but works really well once done (you can read a summary of what needed to properly set it here ).我们在 Mesos 和 S3 上运行 Spark,设置起来有点棘手,但一旦完成就可以很好地运行(您可以在此处阅读正确设置所需内容的摘要)。

(Edit) Note: since version 2.3.0 Spark also added native support for Kubernetes (编辑)注意:自 2.3.0 版以来,Spark 还添加了对 Kubernetes 的本机支持

By default , Spark does not have storage mechanism.默认情况下,Spark 没有存储机制。

To store data, it needs fast and scalable file system.为了存储数据,它需要快速且可扩展的文件系统。 You can use S3 or HDFS or any other file system.您可以使用 S3 或 HDFS 或任何其他文件系统。 Hadoop is economical option due to low cost.由于成本低,Hadoop 是经济的选择。

Additionally if you use Tachyon, it will boost performance with Hadoop.此外,如果您使用 Tachyon,它将通过 Hadoop 提高性能。 It's highly recommended Hadoop for apache spark processing.强烈推荐使用 Hadoop 进行apache spark处理。 在此处输入图片说明

Yes, spark can run without hadoop.是的,spark 可以在没有 hadoop 的情况下运行。 All core spark features will continue to work, but you'll miss things like easily distributing all your files (code as well as data) to all the nodes in the cluster via hdfs, etc.所有核心 spark 功能都将继续工作,但您会错过诸如通过 hdfs 轻松将所有文件(代码和数据)分发到集群中的所有节点等功能。

As per Spark documentation, Spark can run without Hadoop.根据 Spark 文档,Spark 可以在没有 Hadoop 的情况下运行。

You may run it as a Standalone mode without any resource manager.您可以在没有任何资源管理器的情况下以独立模式运行它。

But if you want to run in multi-node setup , you need a resource manager like YARN or Mesos and a distributed file system like HDFS,S3 etc.但是如果你想在多节点设置中运行,你需要一个像 YARN或 Mesos这样的资源管理器和一个像 HDFS、S3 等的分布式文件系统。

Yes, you can install the Spark without the Hadoop.是的,您可以在没有 Hadoop 的情况下安装 Spark。 That would be little tricky You can refer arnon link to use parquet to configure on S3 as data storage.这有点棘手您可以参考 arnon link 使用 parquet 在 S3 上将其配置为数据存储。 http://arnon.me/2015/08/spark-parquet-s3/ http://arnon.me/2015/08/spark-parquet-s3/

Spark is only do processing and it uses dynamic memory to perform the task, but to store the data you need some data storage system. Spark 只是做处理,它使用动态内存来执行任务,但是为了存储数据,你需要一些数据存储系统。 Here hadoop comes in role with Spark, it provide the storage for Spark.这里 hadoop 与 Spark 一起发挥作用,它为 Spark 提供存储。 One more reason for using Hadoop with Spark is they are open source and both can integrate with each other easily as compare to other data storage system.将 Hadoop 与 Spark 一起使用的另一个原因是它们是开源的,与其他数据存储系统相比,两者都可以轻松地相互集成。 For other storage like S3, you should be tricky to configure it like mention in above link.对于 S3 等其他存储,您应该很难像上面链接中提到的那样配置它。

But Hadoop also have its processing unit called Mapreduce.但是 Hadoop 也有它的处理单元,称为 Mapreduce。

Want to know difference in Both?想知道两者的区别吗?

Check this article: https://www.dezyre.com/article/hadoop-mapreduce-vs-apache-spark-who-wins-the-battle/83查看这篇文章: https : //www.dezyre.com/article/hadoop-mapreduce-vs-apache-spark-who-wins-the-battle/83

I think this article will help you understand我想这篇文章会帮助你理解

  • what to use,用什么,

  • when to use and何时使用

  • how to use !!!如何使用 !!!

Yes, of course.是的当然。 Spark is an independent computation framework. Spark 是一个独立的计算框架。 Hadoop is a distribution storage system(HDFS) with MapReduce computation framework. Hadoop 是一个分布式存储系统(HDFS),带有 MapReduce 计算框架。 Spark can get data from HDFS, as well as any other data source such as traditional database(JDBC), kafka or even local disk. Spark 可以从 HDFS 中获取数据,也可以从传统数据库(JDBC)、kafka 甚至本地磁盘等任何其他数据源获取数据。

是的,Spark 可以在安装或不安装 Hadoop 的情况下运行,有关更多详细信息,您可以访问 - https://spark.apache.org/docs/latest/

Yes spark can run without Hadoop.是的,spark 可以在没有 Hadoop 的情况下运行。 You can install spark in your local machine with out Hadoop.您可以在没有 Hadoop 的本地机器上安装 spark。 But Spark lib comes with pre Haddop libraries ie are used while installing on your local machine.但是 Spark lib 带有预 Haddop 库,即在本地机器上安装时使用。

You can run spark without hadoop but spark has dependency on hadoop win-utils.您可以在没有 hadoop 的情况下运行 spark,但 spark 依赖于 hadoop win-utils。 so some features may not work, also if you want to read hive tables from spark then you need hadoop.所以有些功能可能不起作用,如果你想从 spark 读取配置单元表,那么你需要 hadoop。

Not good at english,Forgive me!英语不好,见谅!

TL;DR长话短说

Use local(single node) or standalone(cluster) to run spark without Hadoop, but stills need hadoop dependencies for logging and some file process.使用本地(单节点)或独立(集群)运行没有 Hadoop 的 spark,但仍然需要 hadoop 依赖项来进行日志记录和一些文件处理。
Windows is strongly NOT recommend to run spark! Windows强烈不推荐运行 spark!


Local mode本地模式

There are so many running mode with spark,one of it is called local will running without hadoop dependencies. spark 有很多运行模式,其中一种称为本地运行,无需 hadoop 依赖项。
So,here is the first question: how to tell spark we want to run on local mode?那么,这是第一个问题:如何告诉 spark 我们要在本地模式下运行?
After read this official doc ,i just give it a try on my linux os:看完这个官方文档后,我就在我的 linux 操作系统上试一试:

  1. Must install java and scala,not the core content so skip it.必须安装java和scala,不是核心内容略过。
  2. Download spark package下载火花 package
    There are "without hadoop" and "hadoop integrated" 2 type of package有“without hadoop”和“hadoop integrated”2种package
    The most important thing is "without hadoop" do NOT mean run without hadoop but just not bundle with hadoop so you can bundle it with your custom hadoop!最重要的是“没有 hadoop”并不意味着在没有 hadoop的情况下运行,而是不与 hadoop 捆绑在一起,因此您可以将它与您的自定义 hadoop 捆绑在一起!
    Spark can run without hadoop(HDFS and YARN) but need hadoop dependency jar such as parquet/avro etc SerDe class,so strongly recommend to use "integrated" package(and you will found missing some log dependencies like log4j and slfj and other common utils class if chose "without hadoop" package but all this bundled with hadoop integrated pacakge)! Spark 可以在没有 hadoop(HDFS 和 YARN)的情况下运行,但需要 hadoop 依赖 jar 例如 parquet/avro 等 SerDe class,因此强烈建议使用“集成”包(你会发现缺少一些日志依赖,如 log4j 和 slfj 和其他常用工具class 如果选择“无 hadoop” package 但所有这些都与 hadoop 集成包捆绑在一起)!
  3. Run on local mode在本地模式下运行
    Most simple way is just run shell,and you will see the welcome log最简单的方法就是运行shell,你会看到欢迎日志
# as same as ./bin/spark-shell --master local[*]
./bin/spark-shell

Standalone mode独立模式

As same as blew,but different with step 3.与吹一样,但与步骤3不同。

# Starup cluster
# if you want run on frontend
# export SPARK_NO_DAEMONIZE=true 
./sbin/start-master.sh
# run this on your every worker
./sbin/start-worker.sh spark://VMS110109:7077

# Submit job or just shell
./bin/spark-shell spark://VMS110109:7077

On windows?在 windows 上?

I kown so many people run spark on windown just for study,but here is so different on windows and really strongly NOT recommend to use windows.我知道很多人在 windows 上运行 spark 只是为了学习,但这里在 windows 上非常不同,真的强烈不建议使用 windows。

The most important things is download winutils.exe from here and configure system variable HADOOP_HOME to point where winutils located.最重要的是从这里下载winutils.exe并配置系统变量 HADOOP_HOME 指向winutils所在的位置。

At this moment 3.2.1 is the most latest release version of spark,but a bug is exist.You will got a exception like Illegal character in path at index 32: spark://xxxxxx:63293/D:\classe when run ./bin/spark-shell.cmd ,only startup a standalone cluster then use ./bin/sparkshell.cmd or use lower version can temporary fix this.目前3.2.1是spark最新的release版本,但是存在bug,运行时会出现Illegal character in path at index 32: spark://xxxxxx:63293/D:\classe ./bin/spark-shell.cmd ,只启动独立集群然后使用./bin/sparkshell.cmd或使用较低版本可以临时解决这个问题。 For more detail and solution you can refer for here有关更多详细信息和解决方案,您可以参考此处

不。它需要完整的 Hadoop 安装才能开始工作 - https://issues.apache.org/jira/browse/SPARK-10944

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Apache Spark 3.1.2 无法通过记录的 spark-hadoop-cloud 从 S3 读取 - Apache Spark 3.1.2 can't read from S3 via documented spark-hadoop-cloud Apache Spark AWS Glue 作业与 Hadoop 集群上的 Spark 在存储桶之间传输数据 - Apache Spark AWS Glue job versus Spark on Hadoop cluster for transferring data between buckets 如果我使用 Dataproc,它如何处理从 Apache Hadoop 和 Spark 到 Dataproc 的实时流数据? - If I use Dataproc, how does it process real-time streaming data from Apache Hadoop and Spark to Dataproc? 如何从 AWS S3 运行具有 JAR 依赖项的 Apache Spark 应用程序? - How to run Apache Spark applications with a JAR dependency from AWS S3? 如何使用 Terraform 在 EMR 上安装 Spark,Hadoop? - How to install Spark, Hadoop on EMR using Terraform? 为 AWS EKS 配置 Apache Spark - Configure Apache Spark for AWS EKS EKS 上的 Spark 运算符 Apache spark 无法创建临时目录 - Spark-operator on EKS Apache spark failed to create temp directory Google Cloud Dataflow 可以在 Go 中没有外部 IP 地址的情况下运行吗? - Can Google Cloud Dataflow be run without an external IP address in Go? 使用 Spark/hadoop 访问 s3a 时出现 403 错误 - 403 Error while accessing s3a using Spark/hadoop 使用 org.apache.hadoop:hadoop-aws 从 pyspark 中的 s3 读取文件 - Reading file from s3 in pyspark using org.apache.hadoop:hadoop-aws
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM