简体   繁体   中英

Reading large images from HDFS in mapreduce

There is a very large image (~200MB) in HDFS (block size 64MB). I want to know the following:

  1. How to read the image in a mapReduce job?

  2. Many topics suggest WholeInputFormat. Is there any other alternative and how to do it?

  3. When WholeInputFormat is used, will there be any parallel processing of the blocks? I guess no.

If your block size is 64 MB, most probably HDFS would have split your image file into chunks and replicated it across the cluster, depending on what your cluster configuration is.

Assuming that you want to process your image file as 1 record rather than multiple blocks/line by line, here are a few options I can think of to process image file as a whole.

  1. You can implement a custom input format and a record reader. The isSplitable() method in the input format should return false. The RecordReader.next( LongWritable pos, RecType val ) method should read the entire file and set val to the file contents. This will ensure that the entire file goes to one map task as a single record.

  2. You can sub-class the input format and override the isSplitable() method so that it returns false. This example shows how create a sub-class
    SequenceFileInputFormat to implement a NonSplittableSequenceFileInputFormat.

Although you can use WholeFileInputFormat or SequenceFileInputFormat or something custom to read the image file, the actual issue(in my view) is to draw something out of the read file. OK..You have read the file, now what??How are you going to process your image to detect any object inside your mapper. I'm not saying it's impossible, but it would require a lot work to be done.

IMHO, you are better off using something like HIPI . HIPI provides an API for performing image processing tasks on top of MapReduce framework.

Edit :

If you really want to do it your way, then you need to write a custom InputFormat. Since images are not like text files, you can't use delimiters like \\n for split creation. One possible workaround could be to create splits based on some given number of bytes. For example, if your image file is of 200MB, you could write an InputFormat which will create splits of 100MB(or whatever you give as a parameter in your Job configuration). I had faced such a scenario long ago while dealing with some binary files and this project had helped me a lot.

HTH

I guess it depends on what type of processing you want to perform. If you are trying to perform something that can be done first splitting the big input into smaller image files and then independently processing the blocks and finally stitching the outputs parts back into large final output - then it may be possible. I'm no image expert but suppose if you want to make a color image into grayscale then you may be cut the large image into small images. Then convert them parallelly using MR. Once the mappers are done then stitch them back to one large grayscale image.

If you understand the format of the image then you may write your own recordreader to help the framework understand the record boundaries preventing corruption when they are inputted to the mappers.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM