简体   繁体   中英

Hadoop streaming accessing files in a directory

I wish to access a directory in Hadoop (via Python streaming) and loop through its image files, calculating hashes of each in my mapper. Does the following logic make sense (and instead of hard coding, can I pass the directory to Hadoop as eg -input)?

lotsdir= 'hdfs://localhost:54310/user/hduser/randomimages/' 
import glob
path = lotsdir + '*.*'
files = glob.glob(path)
files.sort()

imagehashes={}
for fname in files:
    imagehashes[fname]=pHash.imagehash( fname )

Yes, the logic makes sense.

But you will very likely have an performance issue since your input files are not in text format, so they will not be properly split on HDFS.

Hopefully, Hadoop provides several ways to fix that issue. For instance, you could either:

You could also try printing out the image file contents as an encoded character string, something like this: [[1, 2, 3], [4, 5, 6]] becomes 1:2:3:4:5:6 in stdin. Then your mapper can read from stdin and decode(since you'd know the image dimensions) it back to a numpy array (just a few lines to number-extarction-ndarray-reshape) code. This basically becomes your image. I'm working on a similar project, and have faced these issues. Hope it works for you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM