简体   繁体   中英

What are the differences between pre-built and user-provided hadoop on spark download page?

These questions have been puzzeling me for a long time:

There are five package types in the second selector when the first one is choosing version 2.4.4 .And I am confued about 3 of them: Pre-built for Apache Hadoop 2.7 , Pre-built with user-provided Apache Hadoop , Pre-built with scala 2.12 and user-provided Apache Hadoop .Let me list my questions one by one.

  1. What are the difference between Pre-built for Apache Hadoop 2.7 and Pre-built with user-provided Apache Hadoop ? Does this mean there are two different situation,I already have a hadoop cluster,and I don't have a hadoop cluster. If the former, I should choose Pre-built with user-provided Apache Hadoop ,and if the latter,this package will install a hadoop cluster for me ?
  2. What are the difference between Pre-built with user-provided Apache Hadoop and Pre-built with scala 2.12 and user-provided Apache Hadoop ? As far as I know, spark have scala already installed when I run spark-shell following the tutorail,whose package seems not to be Pre-built with scala 2.12 and user-provided Apache Hadoop ,but just Pre-built with user-provided Apache Hadoop .(Am I right?) Because I think the command line shows someting using scala:

    scala> val a = 1;

So why there is still another package emphasize that it is pre-built with scala 2.12?

No option will install Hadoop for you. In all cases, Hadoop must pre-exist or its bundled in the Spark download for you and you must first create a HDFS and YARN environment for Spark to run against if you want to run it that way

You can choose the user provided Hadoop if you already have a running cluster and want to add or upgrade Spark, or you're using Spark Standalone, on Mesos, or on Kubernetes instead, in which case Hadoop scripts are not included in the download, although Spark still relies on core Hadoop libraries internally to function

Spark also does not install Scala (or Java) for you. It's simply compiled against Scala 2.12 so trying to run against any other Scala version will result in classpath issues

Summary,

  • we will need to install Hadoop separately in all three cases (1., 2., and 3.) if we want to support HDFS and YARN
  • if we don't want to install Hadoop, we can use pre-built Spark with hadoop and run Spark in Standalone mode
  • if we want to use any version of Hadoop with Spark, then 3. should be used with a separate installation of Hadoop

For Spark 3.1.1 the following package types exist for download:

  1. Pre-built for Apache Hadoop 2.7

This version of spark runs with Hadoop 2.7

  1. Pre-built for Apache Hadoop 3.2 and later

This version of spark runs with Hadoop 3.2 and later

  1. Pre-built with user-provided Apache Hadoop

This version of spark runs with any user-provided version of Hadoop.

From the name of last version (spark-3.1.1-bin-without-hadoop.tgz), it appears that we will need HADOOP for this spark version (ie, 3.) and not the other versions (ie, 1. and 2.). However, the naming is ambiguous. We will need Hadoop only if we want to support HDFS and YARN. In the Standalone mode, Spark can run in a truly distributed setting (or with daemons on a single machine) without Hadoop.

For 1. and 2., you can run Spark without a Hadoop installation as some of the core Hadoop libraries come bundled with the spark prebuilt binary, hence spark-shell would work without throwing any exceptions); for 3., spark will not work unless a HADOOP installation is provided (as 3. comes without the Hadoop runtime).

For more information, refer this from the docs

There are two variants of Spark binary distributions you can download. One is pre-built with a certain version of Apache Hadoop; this Spark distribution contains built-in Hadoop runtime, so we call it with-hadoop Spark distribution. The other one is pre-built with user-provided Hadoop; since this Spark distribution doesn't contain a built-in Hadoop runtime, it's smaller, but users have to provide a Hadoop installation separately. We call this variant no-hadoop Spark distribution. For with-hadoop Spark distribution, since it contains a built-in Hadoop runtime already, by default, when a job is submitted to Hadoop Yarn cluster, to prevent jar conflict, it will not populate Yarn's classpath into Spark ...

Hope this clears a bit of the confusion!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM