简体   繁体   中英

Read a text file from HDFS line by line in mapper

Is the following code for Mappers, reading a text file from HDFS right? And if it is:

  1. What happens if two mappers in different nodes try to open the file at almost the same time?
  2. Isn't there a need to close the InputStreamReader ? If so, how to do it without closing the filesystem?

My code is:

Path pt=new Path("hdfs://pathTofile");
FileSystem fs = FileSystem.get(context.getConfiguration());
BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(pt)));
String line;
line=br.readLine();
while (line != null){
System.out.println(line);

This will work, with some amendments - i assume the code you've pasted is just truncated:

Path pt=new Path("hdfs://pathTofile");
FileSystem fs = FileSystem.get(context.getConfiguration());
BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(pt)));
try {
  String line;
  line=br.readLine();
  while (line != null){
    System.out.println(line);

    // be sure to read the next line otherwise you'll get an infinite loop
    line = br.readLine();
  }
} finally {
  // you should close out the BufferedReader
  br.close();
}

You can have more than one mapper reading the same file, but there is limit at which it makes more sense to use the Distributed Cache (not only reducing the load on the data nodes which host the blocks for the file but also will be more efficient if you have a job with a larger number of tasks than you have task nodes)

import java.io.{BufferedReader, InputStreamReader}


def load_file(path:String)={
    val pt=new Path(path)
    val fs = FileSystem.get(new Configuration())
    val br=new BufferedReader(new InputStreamReader(fs.open(pt)))
    var res:List[String]=  List()
    try {

      var line=br.readLine()
      while (line != null){
        System.out.println(line);

        res= res :+ line
        line=br.readLine()
      }
    } finally {
      // you should close out the BufferedReader
      br.close();
    }

    res
  }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM