简体   繁体   English

Spark将本地文件从主节点分发到节点

[英]Spark distribute local file from master to nodes

I used to run Spark locally and distributing file to nodes has never caused me problems, but now I am moving things to Amazon cluster service and things starts to break down. 我以前在本地运行Spark并将文件分发到节点从来没有给我带来任何问题,但现在我正在将事情转移到Amazon集群服务,事情开始崩溃。 Basically, I am processing some IP using the Maxmind GeoLiteCity.dat, which I placed on the local file system on the master (file:///home/hadoop/GeoLiteCity.dat). 基本上,我正在使用Maxmind GeoLiteCity.dat处理一些IP,我将它放在master上的本地文件系统上(file:///home/hadoop/GeoLiteCity.dat)。

following a question from earlier, I used the sc.addFile: 根据前面的问题,我使用了sc.addFile:

sc.addFile("file:///home/hadoop/GeoLiteCity.dat")

and call on it using something like: 并使用以下内容调用它:

val ipLookups = IpLookups(geoFile = Some(SparkFiles.get("GeoLiteCity.dat")), memCache = false, lruCache = 20000)

This works when running locally on my computer, but seems to be failing on the cluster (I do not know the reason for the failure, but I would appreciate it if someone can tell me how to display the logs for the process, the logs which are generated from Amazon service do not contain any information on which step is failing). 这在我的计算机上本地运行时有效,但似乎在群集上失败(我不知道失败的原因,但如果有人能告诉我如何显示进程的日志,我会很感激从Amazon服务生成的不包含任何有关哪个步骤失败的信息)。

Do I have to somehow load the GeoLiteCity.dat onto the HDFS? 我是否必须以某种方式将GeoLiteCity.dat加载到HDFS上? Are there other ways to distribute a local file from the master across to the nodes without HDFS? 是否还有其他方法可以将主节点中的本地文件分发到没有HDFS的节点?

EDIT: Just to specify the way I run, I wrote a json file which does multiple steps, the first step is to run a bash script which transfers the GeoLiteCity.dat from Amazon S3 to the master: 编辑:只是为了指定我运行的方式,我编写了一个执行多个步骤的json文件,第一步是运行一个bash脚本,将GeoLiteCity.dat从Amazon S3传输到master:

#!/bin/bash
cd /home/hadoop
aws s3 cp s3://test/GeoLiteCity.dat GeoLiteCity.dat

After checking that the file is in the directory, The json then execute the Spark Jar, but fails. 在检查文件是否在目录中之后,json然后执行Spark Jar,但是失败了。 The logs produced by Amazon web UI does not show where the code breaks. Amazon Web UI生成的日志不会显示代码中断的位置。

Instead of copying the file into master, load the file into s3 and read it from there 不是将文件复制到master中,而是将文件加载到s3中并从那里读取

Refer http://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter2/s3.html for reading files from S3. 有关从S3读取文件的信息,请参阅http://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter2/s3.html

You need to provide AWS Access Key ID and Secret Key. 您需要提供AWS访问密钥ID和密钥。 Either set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY or set it programmatically like, 设置环境变量AWS_ACCESS_KEY_ID和AWS_SECRET_ACCESS_KEY或以编程方式设置它,如,

sc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", YOUR_ACCESS_KEY)
sc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", YOUR_SECRET_KEY)

Then you can just read the file as text file. 然后您可以将文件作为文本文件读取。 Like, 喜欢,

 sc.textFile(s3n://test/GeoLiteCity.dat)

Additional reference : How to read input from S3 in a Spark Streaming EC2 cluster application https://stackoverflow.com/a/30852341/4057655 附加参考: 如何在Spark Streaming EC2集群应用程序中读取S3的输入 https://stackoverflow.com/a/30852341/4057655

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 了解Spark:Cluster Manager,Master和Driver节点 - Understand Spark: Cluster Manager, Master and Driver nodes 如何将文件的内容分配到几个节点? - How to distribute a content of File into several nodes? 将文件/块从 HDFS 复制到从节点的本地文件系统 - Copy files/chunks from HDFS to local file system of slave nodes Spark(Scala)从驱动程序写入(和读取)到本地文件系统 - Spark (Scala) Writing (and reading) to local file system from driver 使用Java和Spark从本地图像到HDFS写入序列文件 - Writing a sequence file from an image in local to HDFS using Java and Spark 从本地 spark-submit 检查远程 HDFS 上是否存在文件 - Check if file exists on remote HDFS from local spark-submit 什么是 hadoop(单节点和多节点)、spark-master 和 spark-worker? - What is hadoop (single and multi) nodes, spark-master and spark-worker? Apache Spark:从本地而不是HDFS加载文件,并加载给出IllegalArguementException的本地文件 - Apache Spark : Load file from local instead of HDFS and Loading local file giving IllegalArguementException 在纱线上的spark-submit没有将罐子分配到nm-local-dir - spark-submit on yarn did not distribute jars to nm-local-dir 从Spark Master UI清除Spark Job历史记录 - Clear Spark Job history from spark master UI
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM