简体   繁体   English

使用Java读取远程HDFS文件

[英]Reading remote HDFS file with Java

I'm having a bit of trouble with a simple Hadoop install. 我在使用简单的Hadoop安装时遇到了一些麻烦。 I've downloaded hadoop 2.4.0 and installed on a single CentOS Linux node (Virtual Machine). 我已经下载了hadoop 2.4.0并安装在一个CentOS Linux节点(虚拟机)上。 I've configured hadoop for a single node with pseudo distribution as described on the apache site ( http://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-common/SingleCluster.html ). 我已经为apache站点( http://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-common/SingleCluster.html )上描述的具有伪分发的单个节点配置了hadoop。 It starts with no issues in the logs and I can read + write files using the “hadoop fs” commands from the command line. 它从日志中没有问题开始,我可以使用命令行中的“hadoop fs”命令读取+写入文件。

I'm attempting to read a file from the HDFS on a remote machine with the Java API. 我正在尝试使用Java API从远程计算机上的HDFS读取文件。 The machine can connect and list directory contents. 机器可以连接并列出目录内容。 It can also determine if a file exists with the code: 它还可以确定文件是否存在以及代码:

Path p=new Path("hdfs://test.server:9000/usr/test/test_file.txt");
FileSystem fs = FileSystem.get(new Configuration());
System.out.println(p.getName() + " exists: " + fs.exists(p));

The system prints “true” indicating it exists. 系统打印“true”表示它存在。 However, when I attempt to read the file with: 但是,当我尝试使用以下内容读取文件时:

BufferedReader br = null;
try {
    Path p=new Path("hdfs://test.server:9000/usr/test/test_file.txt");
    FileSystem fs = FileSystem.get(CONFIG);
    System.out.println(p.getName() + " exists: " + fs.exists(p));

    br=new BufferedReader(new InputStreamReader(fs.open(p)));
    String line = br.readLine();

    while (line != null) {
        System.out.println(line);
        line=br.readLine();
    }
}
finally {
    if(br != null) br.close();
}

this code throws the exception: 此代码抛出异常:

Exception in thread "main" org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-13917963-127.0.0.1-1398476189167:blk_1073741831_1007 file=/usr/test/test_file.txt 线程“main”中的异常org.apache.hadoop.hdfs.BlockMissingException:无法获取块:BP-13917963-127.0.0.1-1398476189167:blk_1073741831_1007 file = / usr / test / test_file.txt

Googling gave some possible tips but all checked out. 谷歌搜索提供了一些可能的提示,但都检查了。 The data node is connected, active, and has enough space. 数据节点已连接,处于活动状态,并且具有足够的空间。 The admin report from hdfs dfsadmin –report shows: 来自hdfs dfsadmin -report的管理员报告显示:

Configured Capacity: 52844687360 (49.22 GB) 配置容量:52844687360(49.22 GB)
Present Capacity: 48507940864 (45.18 GB) 现有容量:48507940864(45.18 GB)
DFS Remaining: 48507887616 (45.18 GB) DFS剩余:48507887616(45.18 GB)
DFS Used: 53248 (52 KB) 使用的DFS:53248(52 KB)
DFS Used%: 0.00% 使用DFS%:0.00%
Under replicated blocks: 0 在复制块下:0
Blocks with corrupt replicas: 0 具有损坏副本的块:0
Missing blocks: 0 缺少块:0

Datanodes available: 1 (1 total, 0 dead) 可用的数据点:1(总共1个,0个死)

Live datanodes: 实时数据节点:
Name: 127.0.0.1:50010 (test.server) 名称:127.0.0.1:50010(test.server)
Hostname: test.server 主机名:test.server
Decommission Status : Normal 退役状态:正常
Configured Capacity: 52844687360 (49.22 GB) 配置容量:52844687360(49.22 GB)
DFS Used: 53248 (52 KB) 使用的DFS:53248(52 KB)
Non DFS Used: 4336746496 (4.04 GB) 非DFS使用:4336746496(4.04 GB)
DFS Remaining: 48507887616 (45.18 GB) DFS剩余:48507887616(45.18 GB)
DFS Used%: 0.00% 使用DFS%:0.00%
DFS Remaining%: 91.79% DFS剩余%:91.79%
Configured Cache Capacity: 0 (0 B) 配置的缓存容量:0(0 B)
Cache Used: 0 (0 B) 使用的缓存:0(0 B)
Cache Remaining: 0 (0 B) 剩余高速缓存:0(0 B)
Cache Used%: 100.00% 使用缓存%:100.00%
Cache Remaining%: 0.00% 剩余高速缓存%:0.00%
Last contact: Fri Apr 25 22:16:56 PDT 2014 最后的联系方式:2014年4月25日星期五22:16:56 PDT

The client jars were copied directly from the hadoop install so no version mismatch there. 客户端jar直接从hadoop安装中复制,因此没有版本不匹配。 I can browse the file system with my Java class and read file attributes. 我可以使用Java类浏览文件系统并读取文件属性。 I just can't read the file contents without getting the exception. 我没有得到异常就无法读取文件内容。 If I try to write a file with the code: 如果我尝试用代码编写一个文件:

FileSystem fs = null;
BufferedWriter br = null;

System.setProperty("HADOOP_USER_NAME", "root");

try {
    fs = FileSystem.get(new Configuraion());

    //Path p = new Path(dir, file);
    Path p = new Path("hdfs://test.server:9000/usr/test/test.txt");
    br = new BufferedWriter(new OutputStreamWriter(fs.create(p,true)));
    br.write("Hello World");
}
finally {
    if(br != null) br.close();
    if(fs != null) fs.close();
}

this creates the file but doesn't write any bytes and throws the exception: 这会创建文件,但不会写任何字节并抛出异常:

Exception in thread "main" org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /usr/test/test.txt could only be replicated to 0 nodes instead of minReplication (=1). 线程“main”中的异常org.apache.hadoop.ipc.RemoteException(java.io.IOException):文件/usr/test/test.txt只能复制到0个节点而不是minReplication(= 1)。 There are 1 datanode(s) running and 1 node(s) are excluded in this operation. 在此操作中,有1个数据节点正在运行,1个节点被排除在外。

Googling for this indicated a possible space issue but from the dfsadmin report, it seems there is plenty of space. 谷歌搜索表明可能存在空间问题但是从dfsadmin报告来看,似乎有足够的空间。 This is a plain vanilla install and I can't get past this issue. 这是一个简单的vanilla安装,我无法解决这个问题。

The environment summary is: 环境摘要是:

SERVER: 服务器:

Hadoop 2.4.0 with pseudo-distribution ( http://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-common/SingleCluster.html ) 带有伪分发的Hadoop 2.4.0( http://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-common/SingleCluster.html

CentOS 6.5 Virtual Machine 64 bit server Java 1.7.0_55 CentOS 6.5虚拟机64位服务器Java 1.7.0_55

CLIENT: 客户:

Windows 8 (Virtual Machine) Java 1.7.0_51 Windows 8(虚拟机)Java 1.7.0_51

Any help is greatly appreciated. 任何帮助是极大的赞赏。

Hadoop error messages are frustrating. Hadoop错误消息令人沮丧。 Often they don't say what they mean and have nothing to do with the real issue. 他们常常没有说出他们的意思,也没有与真正的问题无关。 I've seen problems like this occur when the client, namenode, and datanode cannot communicate properly. 我发现当客户端,namenode和datanode无法正常通信时会发生这样的问题。 In your case I would pick one of two issues: 在你的情况下,我会选择以下两个问题之一:

  • Your cluster runs in a VM and its virtualized network access to the client is blocked. 您的群集在VM中运行,并且其对客户端的虚拟化网络访问被阻止。
  • You are not consistently using fully-qualified domain names (FQDN) that resolve identically between the client and host. 您并未始终如一地使用在客户端和主机之间以相同方式解析的完全限定域名(FQDN)。

The host name "test.server" is very suspicious. 主机名“test.server”非常可疑。 Check all of the following: 检查以下所有内容:

  • Is test.server a FQDN? test.server是FQDN吗?
  • Is this the name that has been used EVERYWHERE in your conf files? 这是在conf文件中随处可用的名称吗?
  • Can the client and all hosts forward and reverse resolve "test.server" and its IP address and get the same thing? 客户端和所有主机是否可以正向和反向解析“test.server”及其IP地址并获得相同的功能?
  • Are IP addresses being used instead of FQDN anywhere? 是否在任何地方使用IP地址而不是FQDN?
  • Is "localhost" being used anywhere? “localhost”是否在任何地方使用?

Any inconsistency in the use of FQDN, hostname, numeric IP, and localhost must be removed. 必须删除使用FQDN,主机名,数字IP和localhost的任何不一致。 Do not ever mix them in your conf files or in your client code. 不要在conf文件或客户端代码中混用它们。 Consistent use of FQDN is preferred. 一致使用FQDN是优选的。 Consistent use of numeric IP usually also works. 一致使用数字IP通常也有效。 Use of unqualified hostname, localhost, or 127.0.0.1 cause problems. 使用非限定主机名,localhost或127.0.0.1会导致问题。

The answer above is pointing to the right direction. 上面的答案指向了正确的方向。 Allow me to add the following: 请允许我添加以下内容:

  1. Namenode does NOT directly read or write data. Namenode不直接读取或写入数据。
  2. Client (your Java program using Direct access to HDFS) interacts with Namenode to update HDFS namespace and retrieve block locations for reading/writing. 客户端(使用直接访问HDFS的Java程序)与Namenode交互以更新HDFS命名空间并检索用于读/写的块位置。
  3. Client interacts directly with Datanode to read/write data. 客户端直接与Datanode交互以读/写数据。

You were able to list directory contents because hostname:9000 was accessible to your client code. 您可以列出目录内容,因为您的客户端代码可以访问hostname:9000 You were doing the number 2 above. 你正在做上面的2号。
To be able to read and write, your client code needs access to the Datanode (number 3). 为了能够读写,您的客户端代码需要访问Datanode(编号3)。 The default port for Datanode DFS data transfer is 50010. Something was blocking your client communication to hostname:50010 . Datanode DFS数据传输的默认端口是50010.有些东西阻止了客户端与hostname:50010通信hostname:50010 Possibly a firewall or SSH tunneling configuration problem. 可能是防火墙或SSH隧道配置问题。
I was using Hadoop 2.7.2, so maybe you have a different port number setting. 我使用的是Hadoop 2.7.2,所以你可能有不同的端口号设置。

We need to make sure to have configuration with fs.default.name space set such as 我们需要确保使用fs.default.name空间集进行配置,例如

configuration.set("fs.default.name","hdfs://ourHDFSNameNode:50000");

Below I've put a piece of sample code: 下面我放了一段示例代码:

 Configuration configuration = new Configuration();
 configuration.set("fs.default.name","hdfs://ourHDFSNameNode:50000");
 FileSystem fs = pt.getFileSystem(configuration);
 BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(pt)));
 String line = null;
 line = br.readLine
 while (line != null) {
  try {
    line = br.readLine
    System.out.println(line);
  }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM