[英]Spark distribute local file from master to nodes
I used to run Spark locally and distributing file to nodes has never caused me problems, but now I am moving things to Amazon cluster service and things starts to break down. 我以前在本地运行Spark并将文件分发到节点从来没有给我带来任何问题,但现在我正在将事情转移到Amazon集群服务,事情开始崩溃。 Basically, I am processing some IP using the Maxmind GeoLiteCity.dat, which I placed on the local file system on the master (file:///home/hadoop/GeoLiteCity.dat).
基本上,我正在使用Maxmind GeoLiteCity.dat处理一些IP,我将它放在master上的本地文件系统上(file:///home/hadoop/GeoLiteCity.dat)。
following a question from earlier, I used the sc.addFile: 根据前面的问题,我使用了sc.addFile:
sc.addFile("file:///home/hadoop/GeoLiteCity.dat")
and call on it using something like: 并使用以下内容调用它:
val ipLookups = IpLookups(geoFile = Some(SparkFiles.get("GeoLiteCity.dat")), memCache = false, lruCache = 20000)
This works when running locally on my computer, but seems to be failing on the cluster (I do not know the reason for the failure, but I would appreciate it if someone can tell me how to display the logs for the process, the logs which are generated from Amazon service do not contain any information on which step is failing). 这在我的计算机上本地运行时有效,但似乎在群集上失败(我不知道失败的原因,但如果有人能告诉我如何显示进程的日志,我会很感激从Amazon服务生成的不包含任何有关哪个步骤失败的信息)。
Do I have to somehow load the GeoLiteCity.dat onto the HDFS? 我是否必须以某种方式将GeoLiteCity.dat加载到HDFS上? Are there other ways to distribute a local file from the master across to the nodes without HDFS?
是否还有其他方法可以将主节点中的本地文件分发到没有HDFS的节点?
EDIT: Just to specify the way I run, I wrote a json file which does multiple steps, the first step is to run a bash script which transfers the GeoLiteCity.dat from Amazon S3 to the master: 编辑:只是为了指定我运行的方式,我编写了一个执行多个步骤的json文件,第一步是运行一个bash脚本,将GeoLiteCity.dat从Amazon S3传输到master:
#!/bin/bash
cd /home/hadoop
aws s3 cp s3://test/GeoLiteCity.dat GeoLiteCity.dat
After checking that the file is in the directory, The json then execute the Spark Jar, but fails. 在检查文件是否在目录中之后,json然后执行Spark Jar,但是失败了。 The logs produced by Amazon web UI does not show where the code breaks.
Amazon Web UI生成的日志不会显示代码中断的位置。
Instead of copying the file into master, load the file into s3 and read it from there 不是将文件复制到master中,而是将文件加载到s3中并从那里读取
Refer http://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter2/s3.html for reading files from S3. 有关从S3读取文件的信息,请参阅http://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter2/s3.html 。
You need to provide AWS Access Key ID and Secret Key. 您需要提供AWS访问密钥ID和密钥。 Either set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY or set it programmatically like,
设置环境变量AWS_ACCESS_KEY_ID和AWS_SECRET_ACCESS_KEY或以编程方式设置它,如,
sc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", YOUR_ACCESS_KEY)
sc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", YOUR_SECRET_KEY)
Then you can just read the file as text file. 然后您可以将文件作为文本文件读取。 Like,
喜欢,
sc.textFile(s3n://test/GeoLiteCity.dat)
Additional reference : How to read input from S3 in a Spark Streaming EC2 cluster application https://stackoverflow.com/a/30852341/4057655 附加参考: 如何在Spark Streaming EC2集群应用程序中读取S3的输入 https://stackoverflow.com/a/30852341/4057655
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.