简体   繁体   English

从 Windows 机器连接到 Spark

[英]Connect from a windows machine to Spark

I'm very (very!) new to Spark and Scala.我对 Spark 和 Scala 非常(非常!)不熟悉。 I've been trying to implement what I thought to be the easy task of connecting to a linux machine that has Spark on it, and running a simple code.我一直在尝试实现我认为是连接到装有 Spark 的 linux 机器并运行简单代码的简单任务。

When I create a simple Scala code, build a jar from it, place it in the machine and run spark-submit, everything works and I get a result.当我创建一个简单的 Scala 代码时,从中构建一个 jar,将其放入机器并运行 spark-submit,一切正常,我得到了结果。 (like the "SimpleApp" example here: http://spark.apache.org/docs/latest/quick-start.html ) (就像这里的“SimpleApp”示例: http : //spark.apache.org/docs/latest/quick-start.html

My question is: Are all of these steps mandatory?我的问题是:所有这些步骤都是强制性的吗? ? ? Must I compile, build and copy the jar to the machine and then manually run it every I change it?我必须编译、构建并将 jar 复制到机器上,然后每次更改它时都手动运行它吗?

Assume that the jar is already on the machine, is there a way to run it (calling spark-submit) directly from a different code through my IDE?假设 jar 已经在机器上,有没有办法通过我的 IDE 直接从不同的代码运行它(调用 spark-submit)?

Taking it a bit further, if lets say I want to run different tasks, do I have to create different jars and place all of them on the machine?更进一步,如果说我想运行不同的任务,我是否必须创建不同的 jar 并将它们全部放在机器上? Are there any other approaches?还有其他方法吗?

Any help will be appreciated!任何帮助将不胜感激! Thanks!谢谢!

There are two modes of running your code either submitting your job to the server.有两种运行您的代码的模式,或者将您的作业提交给服务器。 or by running in local mode which requires no Spark Cluster to be setup.或者通过在不需要设置 Spark 集群的本地模式下运行。 Most generally use this for building and testing their application on small data-sets and then build and submit the tasks as jobs for production.大多数情况下,它用于在小数据集上构建和测试他们的应用程序,然后构建和提交任务作为生产作业。

Running in Local Mode在本地模式下运行

val conf = new SparkConf().setMaster("local").setAppName("wordCount Example")

Setting master as "local" spark along with your application.将 master 与您的应用程序一起设置为“本地”火花。

If you have already Built you jars you can use the same by specifying the spark masters url and by adding the required jars you can submit the job to a remote cluster.如果您已经构建了 jars,您可以通过指定 spark masters url 来使用它,并通过添加所需的 jars 将作业提交到远程集群。

val conf = new SparkConf()
      .setMaster("spark://cyborg:7077")
      .setAppName("SubmitJobToCluster Example")
      .setJars(Seq("target/spark-example-1.0-SNAPSHOT-driver.jar"))

Using the spark conf you can initialize SparkContext in your application and use it either in a local or cluster setup.使用 spark conf,您可以在应用程序中初始化 SparkContext,并在本地或集群设置中使用它。

 val sc = new SparkContext(conf)

This is a old project spark-examples you have samples programs which you can run directly from your IDE.这是一个旧项目spark-examples,您有可以直接从 IDE 运行的示例程序。

So Answering you questions所以回答你的问题

  • Are all of these steps mandatory?所有这些步骤都是强制性的吗? ? ? Must I compile, build and copy the jar to the machine and then manually run it every I change it?我必须编译、构建并将 jar 复制到机器上,然后每次更改它时都手动运行它吗? NO
  • Assume that the jar is already on the machine, is there a way to run it (calling spark-submit) directly from a different code through my IDE?假设 jar 已经在机器上,有没有办法通过我的 IDE 直接从不同的代码运行它(调用 spark-submit)? Yes you can.是的,你可以。 The above example does it.上面的例子做到了。
  • Taking it a bit further, if lets say I want to run different tasks, do I have to create different jars and place all of them on the machine?更进一步,如果说我想运行不同的任务,我是否必须创建不同的 jar 并将它们全部放在机器上? Are there any other approaches?还有其他方法吗? Yes You just need one jar containing all your tasks and dependencies you can specify the class while submitting the job to spark.是的,您只需要一个包含所有任务和依赖项的 jar,您就可以在将作业提交给 spark 时指定类。 When doing it pro-grammatically you have complete control over it.以编程方式执行此操作时,您可以完全控制它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM