简体   繁体   中英

How to read from hdfs using spark-shell in Intel hadoop?

I am not able to read from HDFS (Intel distribution hadoop, Hadoop version is 1.0.3) from spark-shell(spark version is 1.2.1). I built spark using the command mvn -Dhadoop.version=1.0.3 clean package , started spark-shell and read a HDFS file using sc.textFile() and the exception is:

WARN hdfs.DFSClient: Failed to connect to /10.xx.xx.xx:50010, add to deadNodes and continuejava.net.SocketTimeoutException: 120000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.xx.xx.xx:44264 remote=/10.xx.xx.xx:50010] ... ERROR executor.Executor: Exception in task 1.0 in stage 0.0 (TID 1) java.io.IOException: Could not obtain block: blk_8724894648624652503_7309 file=/research/Files/README.md

Same question is asked here : http://mail-archives.us.apache.org/mod_mbox/spark-user/201309.mbox/%3CF97ADEE4FBA8F6478453E148FE9E2E8D3CCA37A9@HASMSX106.ger.corp.intel.com%3E

This was the suggested solution:

"In addition to specifying HADOOP_VERSION=1.0.3 in the ./project/SparkBuild.scala file, you will need to specify the libraryDependencies and name "spark-core" resolvers. Otherwise, sbt will fetch version 1.0.3 of hadoop-core from apache instead of Intel. You can set up your own local or remote repository that you specify"

Can anybody please elaborate on how to specify that SBT should fetch hadoop-core from Intel (which is available in our internal repository)?

Try to take a look at this page of the documentation

Spark is using some SBT/maven integrations that I do not know a lot about, but it seem like the repositories are specified in pom.xml in the root

If that does not work, you can explore the where the sbt-files are specifying resolvers .


For the record, this is an excerpt from the linked documentation

Linking Applications to the Hadoop Version

In addition to compiling Spark itself against the right version, you need to add a Maven dependency on that version of hadoop-client to any Spark applications you run, so they can also talk to the HDFS version on the cluster. If you are using CDH, you also need to add the Cloudera Maven repository. This looks as follows in SBT:

libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "<version>"

// If using CDH, also add Cloudera repo
resolvers += "Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos/"

Or in Maven:

<project>
  <dependencies>
    ...
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>[version]</version>
    </dependency>
  </dependencies>

  <!-- If using CDH, also add Cloudera repo -->
  <repositories>
    ...
    <repository>
      <id>Cloudera repository</id>
      <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
  </repositories>
</project>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM