Hadoop streaming accessing files in a directory

Question

I wish to access a directory in Hadoop (via Python streaming) and loop through its image files, calculating hashes of each in my mapper. Does the following logic make sense (and instead of hard coding, can I pass the directory to Hadoop as eg -input)?

lotsdir= 'hdfs://localhost:54310/user/hduser/randomimages/' 
import glob
path = lotsdir + '*.*'
files = glob.glob(path)
files.sort()

imagehashes={}
for fname in files:
    imagehashes[fname]=pHash.imagehash( fname )

Answer 1

Yes, the logic makes sense.

But you will very likely have an performance issue since your input files are not in text format, so they will not be properly split on HDFS.

Hopefully, Hadoop provides several ways to fix that issue. For instance, you could either:

convert your image files into SequenceFile and store them into the HDFS
write your own InputFormat , OutputFormat and RecordReader in order to split them properly

Answer 2

You could also try printing out the image file contents as an encoded character string, something like this: [[1, 2, 3], [4, 5, 6]] becomes 1:2:3:4:5:6 in stdin. Then your mapper can read from stdin and decode(since you'd know the image dimensions) it back to a numpy array (just a few lines to number-extarction-ndarray-reshape) code. This basically becomes your image. I'm working on a similar project, and have faced these issues. Hope it works for you.

Hadoop streaming accessing files in a directory

Question

2 answers

solution1
0 2015-06-08 07:30:24

solution2
0 2019-04-17 15:24:33

Hadoop streaming accessing files in a directory

Question

2 answers

solution1 0 2015-06-08 07:30:24

solution2 0 2019-04-17 15:24:33

solution1
0 2015-06-08 07:30:24

solution2
0 2019-04-17 15:24:33