向emr提交本地Spark工作

Question

im following the amazon doc on submiting spark jobs to emr cluster https://aws.amazon.com/premiumsupport/knowledge-center/emr-submit-spark-job-remote-cluster/ 我按照亚马逊文档向Spark集群提交Spark作业https://aws.amazon.com/premiumsupport/knowledge-center/emr-submit-spark-job-remote-cluster/

after following the instructions, with the frecuent troubleshoot it fails due to unresolved address with a message similar to. 在按照说明进行操作之后，由于无法解决的地址（带有类似的消息）而导致故障排除失败。

ERROR spark.SparkContext: Error initializing SparkContext. 错误spark.SparkContext：初始化SparkContext时出错。 java.lang.IllegalArgumentException: java.net.UnknownHostException: ip-172-32-1-231.us-east-2.compute.internal at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176) java.lang.IllegalArgumentException：java.net.UnknownHostException：ip-172-32-1-231.us-east-2.compute.internal在org.apache.hadoop.security.SecurityUtil.buildTokenService（SecurityUtil.java:374）在org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy（NameNodeProxies.java:310）在org.apache.hadoop.hdfs.NameNodeProxies.createProxy（NameNodeProxies.java:176）

as i saw that the IP it was trying to resolve was the master node one, i changed it with sed to the public one in the configurations files (the ones obtained from the /etc/hadoop/conf directory in the master node). 当我看到它要解析的IP是主节点时，我将sed更改为配置文件（从主节点中的/ etc / hadoop / conf目录获得的）中的公共IP。 but then the error is connecting to the datanodes 但是错误是连接到数据节点

INFO hdfs.DFSClient: Exception in createBlockOutputStream org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. 信息hdfs.DFSClient：createBlockOutputStream org.apache.hadoop.net.ConnectTimeoutException中的异常：等待通道准备好进行连接时60000毫秒超时。 ch : java.nio.channels.SocketChannel[connection-pending remote=/172.32.1.41:50010] at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533) at org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1606) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1404) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1357) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:587) 19/02/08 13:54:58 INFO hdfs.DFSClient: Abandoning BP-1960505320-172.32.1.231-1549632479324:blk_1073741907_1086 ch：位于org.apache.hadoop.hdfs的org.apache.hadoop.net.NetUtils.connect（NetUtils.java:533）上的java.nio.channels.SocketChannel [connection-pending remote = / 172.32.1.41：50010]。 org.apache.hadoop.hdfs.DFSOutputStream $ DataStreamer.createBlockOutputStream（DFSOutputStream.java:1404）的DFSOutputStream.createSocketForPipeline（DFSOutputStream.java:1606）在org.apache.hadoop.hdfs.DFSOutputStream $ DataStreamer.nextBlockOutputStream（DFS 1357）at org.apache.hadoop.hdfs.DFSOutputStream $ DataStreamer.run（DFSOutputStream.java:587）19/02/08 13:54:58 INFO hdfs.DFSClient：放弃BP-1960505320-172.32.1.231-1549632479324：blk_1073741907_1086

finally i tried the same solution as this question = Spark HDFS Exception in createBlockOutputStream while uploading resource file 最后我在上传资源文件时尝试了与这个问题相同的解决方案= createBlockOutputStream中的Spark HDFS异常

which was to add to the hdfs-site.xml file the following: 将以下内容添加到hdfs-site.xml文件中：

<property>
  <name>dfs.client.use.datanode.hostname</name>
  <value>true</value>
</property>

but the error persist as unresolved address exception 但是错误仍然存在，因为未解决的地址异常

19/02/08 13:58:06 WARN hdfs.DFSClient: DataStreamer Exception
java.nio.channels.UnresolvedAddressException
    at sun.nio.ch.Net.checkAddress(Net.java:101)
    at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
    at org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1606)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1404)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1357)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:587)

could somebody help me set up spark in my local machine to do spark-submit to remote EMR? 有人可以帮助我在本地计算机上设置Spark以便将火花提交给远程EMR吗？

Answer 1

除了按照链接的问题给出答案外，还应将工作节点的（公用）IP和（专用）DNS添加到您的/ etc / hosts文件中。

向emr提交本地Spark工作

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-02-12 22:12:03

向emr提交本地Spark工作

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-02-12 22:12:03

解决方案1
2 已采纳 2019-02-12 22:12:03