简体   繁体   English

使用 Java 以编程方式读取存储在 HDFS 中的文本文件的内容

[英]Programmatically reading contents of text file stored in HDFS using Java

How do I run this simple Java program to read bytes from a text file stored in directory/words in HDFS?如何运行这个简单的 Java 程序从存储在 HDFS 目录/字中的文本文件中读取字节? Do I need to create a jar file for the purpose?我需要为此创建一个 jar 文件吗?

import java.io.*;
import java.net.MalformedURLException;
import java.net.URL;
import org.apache.hadoop.*;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class filesystemhdfs 
{
    public static void main(String args[]) throws MalformedURLException, IOException
    {
        byte[] b=null;
        InputStream in=null;
        in=new URL("hdfs://localhost/words/file").openStream();
        in.read(b);
        System.out.println(""+b);
        for(int i=0;i<b.length;i++)
        {
            System.out.println("b[i]=%d"+b[i]);
            System.out.println(""+(char)b[i]);
        }
    }
}

You can use the HDFS API, this can be run from local.:您可以使用 HDFS API,这可以从本地运行。:

Configuration configuration = new Configuration();
        configuration.set("fs.defaultFS", "hdfs://namenode:8020");
        FileSystem fs = FileSystem.get(configuration);
Path filePath = new Path(
                "hdfs://namenode:8020/PATH");

        FSDataInputStream fsDataInputStream = fs.open(filePath);

First, you need to tell the JVM about the HDFS scheme in the URLs objects.首先,您需要在 URLs 对象中告诉 JVM 有关 HDFS 方案的信息。 This is done via:这是通过以下方式完成的:

URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());

After compiling your Java class, you need to use hadoop command:编译 Java 类后,您需要使用hadoop命令:

hadoop filesystemhdfs

Hadoop comes with a convenient IOUtils . Hadoop 带有一个方便的IOUtils It will ease a lot of stuff for you.它会为你减轻很多事情。

You can not read a file from HDFS , as a regular filesystem java supports.您不能从 HDFS 读取文件,因为 java 支持的常规文件系统。 You need to use HDFS java AP I for this.为此,您需要使用HDFS java AP I。

public static void main(String a[]) {
     UserGroupInformation ugi
     = UserGroupInformation.createRemoteUser("root");

     try {


        ugi.doAs(new PrivilegedExceptionAction<Void>() {

            public Void run() throws Exception {

               Configuration conf = new Configuration();
                    //fs.default.name should match the corresponding value 
                    // in your core-site.xml in hadoop cluster
                conf.set("fs.default.name","hdfs://hostname:9000");
                conf.set("hadoop.job.ugi", "root");

                 readFile("words/file",conf) 

                return null;
            }
        });

    } catch (Exception e) {
        e.printStackTrace();
    }

}

 public static void readFile(String file,Configuration conf) throws IOException {
    FileSystem fileSystem = FileSystem.get(conf);

    Path path = new Path(file);
    if (!ifExists(path)) {
        System.out.println("File " + file + " does not exists");
        return;
    }

    FSDataInputStream in = fileSystem.open(path);

    BufferedReader br = new BufferedReader(new InputStreamReader(in));
    String line = null;
    while((line = br.readLine())!= null){
        System.out.println(line);
    }
    in.close();
    br.close();
    fileSystem.close();
 }
   public static boolean ifExists(Path source) throws IOException {

    FileSystem hdfs = FileSystem.get(conf);
    boolean isExists = hdfs.exists(source);
    System.out.println(isExists);
    return isExists;
 }

Here I am trying from a remote machine, that's why I am using UserGroupInformation and write code in the run method of PrivilegedExceptionAction .在这里,我正在远程机器上尝试,这就是为什么我使用UserGroupInformation并在PrivilegedExceptionAction的 run 方法中编写代码。 If you are in the local system you may not need it.如果您在本地系统中,您可能不需要它。 HTH!哼!

Its a bit late to reply, but it will help future reader.回复有点晚,但它会帮助未来的读者。 It will iterate your HDFS directory and will read the content of each file.它将迭代您的 HDFS 目录并读取每个文件的内容。

Hadoop client and Java is used only.仅使用 Hadoop 客户端和 Java。

Configuration conf = new Configuration();
            conf.addResource(new Path(“/your/hadoop/conf/core-site.xml"));
            conf.addResource(new Path("/your/hadoop/confhdfs-site.xml"));
            FileSystem fs = FileSystem.get(conf);
            FileStatus[] status = fs.listStatus(new Path("hdfs://path/to/your/hdfs/directory”);
            for (int i = 0; i < status.length; i++) {
                FSDataInputStream inputStream = fs.open(status[i].getPath());
                String content = IOUtils.toString(inputStream, "UTF-8");
            }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM