简体   繁体   English

如何在HDFS上读取文件以在Hadoop上进行分布式缓存

[英]How to read a file on HDFS for distributed cache on Hadoop

I am trying to load a file in the distributed cache in hadoop from HDFS but it does not work. 我正在尝试从HDFS加载hadoop的分布式缓存中的文件,但是它不起作用。 I am using hadoop version 2.5.1 . 我正在使用hadoop 2.5.1版本。 This is the code on how I am using the cached file in the mapper: 这是关于我如何在映射器中使用缓存文件的代码:

@Override
    protected void setup(Context context) throws IOException, InterruptedException {
        URI[] uris = context.getCacheFiles();

        for (URI uri : uris) {
            File usersFile = new File(uri);
            BufferedReader reader = null;
            reader = new BufferedReader(new FileReader(usersFile));
            String line = reader.readLine();

            ...

            reader.close();
        }
    }

Below is the three different way I tried to load the cache in my driver: 1) If I put the file in cache like this it works, but it will load the file from my local FS (I am running the code on a mac). 以下是我尝试在驱动程序中加载缓存的三种不同方式:1)如果我将文件放在这样的缓存中,它可以工作,但是它将从本地FS加载文件(我在Mac上运行代码) 。

 job.addCacheFile(new URI("file:///input/users.txt"));

2) If I use hdfs as a scheme as follows (the file exist on hdfs under "/input/"): 2)如果我按照以下方式使用hdfs(文件存在于hdfs的“ / input /”下):

job.addCacheFile(new URI("hdfs:///input/users.txt"));

I get this exception: 我得到这个例外:

java.lang.Exception: java.lang.IllegalArgumentException: URI scheme is not "file"
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.IllegalArgumentException: URI scheme is not "file"
    at java.io.File.<init>(File.java:395)

3) This is the third way I tried to load the file: 3)这是我尝试加载文件的第三种方式:

job.addCacheFile(new URI("hdfs://localhost:9000/input/users.txt"));

I get the following exception: 我得到以下异常:

java.lang.Exception: java.lang.IllegalArgumentException: URI scheme is not "file"
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.IllegalArgumentException: URI scheme is not "file"
    at java.io.File.<init>(File.java:395)

I would appreciate if someone can shed light on why these exceptions occur. 如果有人能阐明为什么发生这些异常,我将不胜感激。

I was having the same issue. 我有同样的问题。 There is a reference to a different method if the files are in hdfs in the Pro Apache Hadoop 2nd Edition. 如果文件位于Pro Apache Hadoop 2nd Edition中的hdfs中,则引用了另一种方法。 You can only access the sample code by downloading the source code for the book. 您只能通过下载本书的源代码来访问示例代码。 Its in the chapter 6 folder (MapSideJoinMRJob3.java) I will post the snippets that are different. 我将在第6章文件夹(MapSideJoinMRJob3.java)中发布不同的代码段。 Hope this helps:: 希望这可以帮助::

    private FileSystem hdfs = null;

    public List<String> readLinesFromJobFS(Path p) throws Exception {
        List<String> ls = new ArrayList<String>();

        BufferedReader br = new BufferedReader(new InputStreamReader(
                this.hdfs.open(p)));
        String line;
        line = br.readLine();
        while (line != null) {
            line = br.readLine();
            if(line!=null)
                ls.add(line);
        }
        return ls;
     }

   public void setup(Context context) {

        try {
            this.hdfs = FileSystem.get(context.getConfiguration());
            URI[] uris = context.getCacheFiles();
            for (URI uri : uris) {
                if (uri.toString().endsWith("airports.csv")) {
                    this.readAirports(uri);
                }
                if (uri.toString().endsWith("carriers.csv")) {
                    this.readCarriers(uri);
                }
            }
        } catch (Exception ex) {
            ex.printStackTrace();
            throw new RuntimeException(ex);
        }

    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM