简体   繁体   中英

How to read a file on HDFS for distributed cache on Hadoop

I am trying to load a file in the distributed cache in hadoop from HDFS but it does not work. I am using hadoop version 2.5.1 . This is the code on how I am using the cached file in the mapper:

@Override
    protected void setup(Context context) throws IOException, InterruptedException {
        URI[] uris = context.getCacheFiles();

        for (URI uri : uris) {
            File usersFile = new File(uri);
            BufferedReader reader = null;
            reader = new BufferedReader(new FileReader(usersFile));
            String line = reader.readLine();

            ...

            reader.close();
        }
    }

Below is the three different way I tried to load the cache in my driver: 1) If I put the file in cache like this it works, but it will load the file from my local FS (I am running the code on a mac).

 job.addCacheFile(new URI("file:///input/users.txt"));

2) If I use hdfs as a scheme as follows (the file exist on hdfs under "/input/"):

job.addCacheFile(new URI("hdfs:///input/users.txt"));

I get this exception:

java.lang.Exception: java.lang.IllegalArgumentException: URI scheme is not "file"
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.IllegalArgumentException: URI scheme is not "file"
    at java.io.File.<init>(File.java:395)

3) This is the third way I tried to load the file:

job.addCacheFile(new URI("hdfs://localhost:9000/input/users.txt"));

I get the following exception:

java.lang.Exception: java.lang.IllegalArgumentException: URI scheme is not "file"
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.IllegalArgumentException: URI scheme is not "file"
    at java.io.File.<init>(File.java:395)

I would appreciate if someone can shed light on why these exceptions occur.

I was having the same issue. There is a reference to a different method if the files are in hdfs in the Pro Apache Hadoop 2nd Edition. You can only access the sample code by downloading the source code for the book. Its in the chapter 6 folder (MapSideJoinMRJob3.java) I will post the snippets that are different. Hope this helps::

    private FileSystem hdfs = null;

    public List<String> readLinesFromJobFS(Path p) throws Exception {
        List<String> ls = new ArrayList<String>();

        BufferedReader br = new BufferedReader(new InputStreamReader(
                this.hdfs.open(p)));
        String line;
        line = br.readLine();
        while (line != null) {
            line = br.readLine();
            if(line!=null)
                ls.add(line);
        }
        return ls;
     }

   public void setup(Context context) {

        try {
            this.hdfs = FileSystem.get(context.getConfiguration());
            URI[] uris = context.getCacheFiles();
            for (URI uri : uris) {
                if (uri.toString().endsWith("airports.csv")) {
                    this.readAirports(uri);
                }
                if (uri.toString().endsWith("carriers.csv")) {
                    this.readCarriers(uri);
                }
            }
        } catch (Exception ex) {
            ex.printStackTrace();
            throw new RuntimeException(ex);
        }

    }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM