简体   繁体   中英

How to read data from HDFS using Spark?

I am trying to read data from my hdfs, the location is also mentioned. But I'm not getting the data because it is showing some ConnectionException.

I'm attaching the log files also. What will be the port number for hadoop ? Should we track for 50070?

import org.apache.spark.SparkContext;
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import java.io.PrintWriter;
import java.net.URI;


object random {
  def main(args :Array[String]) :Unit=
  {
   System.setProperty("hadoop.home.dir", "D:\\Softwares\\Hadoop")
   val conf=new SparkConf().setMaster("local").setAppName("Hello");
    val sc=new SparkContext(conf);


    val hdfs = FileSystem.get(new URI("hdfs://104.211.213.47:50070/"), new Configuration()) 
    val path = new Path("/user/m1047068/retail/logerrors.txt")
    val stream = hdfs.open(path)
    def readLines = Stream.cons(stream.readLine, Stream.continually( stream.readLine))

    //This example checks line for null and prints every existing line consequentally
    readLines.takeWhile(_ != null).foreach(line => println(line))
  }
}



--------------------------------------------------------------------------------

This is the log files I'm getting. I'm not aware of the exception, as I'm new to this Spark field.

    2018-09-17 14:50:51 INFO  SparkContext:54 - Running Spark version 2.3.0
2018-09-17 14:50:51 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-09-17 14:50:51 INFO  SparkContext:54 - Submitted application: Hello
2018-09-17 14:50:51 INFO  SecurityManager:54 - Changing view acls to: M1047068
2018-09-17 14:50:51 INFO  SecurityManager:54 - Changing modify acls to: M1047068
2018-09-17 14:50:51 INFO  SecurityManager:54 - Changing view acls groups to: 
2018-09-17 14:50:51 INFO  SecurityManager:54 - Changing modify acls groups to: 
2018-09-17 14:50:51 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(M1047068); groups with view permissions: Set(); users  with modify permissions: Set(M1047068); groups with modify permissions: Set()
2018-09-17 14:50:52 INFO  Utils:54 - Successfully started service 'sparkDriver' on port 51772.
2018-09-17 14:50:52 INFO  SparkEnv:54 - Registering MapOutputTracker
2018-09-17 14:50:52 INFO  SparkEnv:54 - Registering BlockManagerMaster
2018-09-17 14:50:52 INFO  BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2018-09-17 14:50:52 INFO  BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
2018-09-17 14:50:52 INFO  DiskBlockManager:54 - Created local directory at C:\Users\M1047068\AppData\Local\Temp\blockmgr-682d85a7-831e-4178-84de-5ade348a45f4
2018-09-17 14:50:52 INFO  MemoryStore:54 - MemoryStore started with capacity 896.4 MB
2018-09-17 14:50:52 INFO  SparkEnv:54 - Registering OutputCommitCoordinator
2018-09-17 14:50:53 INFO  log:192 - Logging initialized @3046ms
2018-09-17 14:50:53 INFO  Server:346 - jetty-9.3.z-SNAPSHOT
2018-09-17 14:50:53 INFO  Server:414 - Started @3188ms
2018-09-17 14:50:53 INFO  AbstractConnector:278 - Started ServerConnector@493dc226{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-09-17 14:50:53 INFO  Utils:54 - Successfully started service 'SparkUI' on port 4040.
2018-09-17 14:50:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@16ce702d{/jobs,null,AVAILABLE,@Spark}
2018-09-17 14:50:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@40238dd0{/jobs/json,null,AVAILABLE,@Spark}
2018-09-17 14:50:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7776ab{/jobs/job,null,AVAILABLE,@Spark}
2018-09-17 14:50:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@dbd8e44{/jobs/job/json,null,AVAILABLE,@Spark}
2018-09-17 14:50:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@51acdf2e{/stages,null,AVAILABLE,@Spark}
2018-09-17 14:50:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6a55299e{/stages/json,null,AVAILABLE,@Spark}
2018-09-17 14:50:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2f1de2d6{/stages/stage,null,AVAILABLE,@Spark}
2018-09-17 14:50:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3a0baae5{/stages/stage/json,null,AVAILABLE,@Spark}
2018-09-17 14:50:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7ac0e420{/stages/pool,null,AVAILABLE,@Spark}
2018-09-17 14:50:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@289710d9{/stages/pool/json,null,AVAILABLE,@Spark}
2018-09-17 14:50:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5a18cd76{/storage,null,AVAILABLE,@Spark}
2018-09-17 14:50:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3da30852{/storage/json,null,AVAILABLE,@Spark}
2018-09-17 14:50:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@403f0a22{/storage/rdd,null,AVAILABLE,@Spark}
2018-09-17 14:50:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@503ecb24{/storage/rdd/json,null,AVAILABLE,@Spark}
2018-09-17 14:50:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4c51cf28{/environment,null,AVAILABLE,@Spark}
2018-09-17 14:50:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6995bf68{/environment/json,null,AVAILABLE,@Spark}
2018-09-17 14:50:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5143c662{/executors,null,AVAILABLE,@Spark}
2018-09-17 14:50:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@77825085{/executors/json,null,AVAILABLE,@Spark}
2018-09-17 14:50:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3568f9d2{/executors/threadDump,null,AVAILABLE,@Spark}
2018-09-17 14:50:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@71c27ee8{/executors/threadDump/json,null,AVAILABLE,@Spark}
2018-09-17 14:50:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3e7dd664{/static,null,AVAILABLE,@Spark}
2018-09-17 14:50:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4748a0f9{/,null,AVAILABLE,@Spark}
2018-09-17 14:50:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4b14918a{/api,null,AVAILABLE,@Spark}
2018-09-17 14:50:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@77d67cf3{/jobs/job/kill,null,AVAILABLE,@Spark}
2018-09-17 14:50:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6dee4f1b{/stages/stage/kill,null,AVAILABLE,@Spark}
2018-09-17 14:50:53 INFO  SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://G1C2ML15621.mindtree.com:4040
2018-09-17 14:50:53 INFO  Executor:54 - Starting executor ID driver on host localhost
2018-09-17 14:50:53 INFO  Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 51781.
2018-09-17 14:50:53 INFO  NettyBlockTransferService:54 - Server created on G1C2ML15621.mindtree.com:51781
2018-09-17 14:50:53 INFO  BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2018-09-17 14:50:53 INFO  BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, G1C2ML15621.mindtree.com, 51781, None)
2018-09-17 14:50:53 INFO  BlockManagerMasterEndpoint:54 - Registering block manager G1C2ML15621.mindtree.com:51781 with 896.4 MB RAM, BlockManagerId(driver, G1C2ML15621.mindtree.com, 51781, None)
2018-09-17 14:50:53 INFO  BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, G1C2ML15621.mindtree.com, 51781, None)
2018-09-17 14:50:53 INFO  BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, G1C2ML15621.mindtree.com, 51781, None)
2018-09-17 14:50:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6cbcf243{/metrics/json,null,AVAILABLE,@Spark}
Exception in thread "main" java.net.ConnectException: Call From G1C2ML15621/172.17.124.224 to 104.211.213.47:50070 failed on connection exception: java.net.ConnectException: Connection refused: no further information; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
    at java.lang.reflect.Constructor.newInstance(Unknown Source)
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
    at org.apache.hadoop.ipc.Client.call(Client.java:1479)
    at org.apache.hadoop.ipc.Client.call(Client.java:1412)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
    at com.sun.proxy.$Proxy15.getBlockLocations(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:255)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
    at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1226)
    at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1213)
    at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1201)
    at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:306)
    at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:272)
    at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:264)
    at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1526)
    at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
    at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
    at random$.main(random.scala:20)
    at random.main(random.scala)
Caused by: java.net.ConnectException: Connection refused: no further information
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
    at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
    at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
    at org.apache.hadoop.ipc.Client.call(Client.java:1451)
    ... 25 more
2018-09-17 14:51:00 INFO  SparkContext:54 - Invoking stop() from shutdown hook
2018-09-17 14:51:00 INFO  AbstractConnector:318 - Stopped Spark@493dc226{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-09-17 14:51:00 INFO  SparkUI:54 - Stopped Spark web UI at http://G1C2ML15621.mindtree.com:4040
2018-09-17 14:51:00 INFO  MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2018-09-17 14:51:00 INFO  MemoryStore:54 - MemoryStore cleared
2018-09-17 14:51:00 INFO  BlockManager:54 - BlockManager stopped
2018-09-17 14:51:00 INFO  BlockManagerMaster:54 - BlockManagerMaster stopped
2018-09-17 14:51:00 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2018-09-17 14:51:00 INFO  SparkContext:54 - Successfully stopped SparkContext
2018-09-17 14:51:00 INFO  ShutdownHookManager:54 - Shutdown hook called
2018-09-17 14:51:00 INFO  ShutdownHookManager:54 - Deleting directory C:\Users\M1047068\AppData\Local\Temp\spark-84d5b3c8-a609-42da-8e5e-5492400f309d

Spark can't read from webhdfs.

You need to use the port number that exists on the fs.defaultFS property in your core-site.xml

And you don't need to set hadoop home property if you copy your Hadoop XML files into the conf folder in the Spark installation as well as define HADOOP_CONF_DIR environment folder

And as of Spark2, you want to be using SparkSession, and from a session, you would use textFile method for reading a file.

You'll never need to create a raw filesystem object yourself in Spark.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM