简体   繁体   中英

hadoop how to get distribution cache files in mapper if I use -files command line option

I run hadoop map red jobs from a remote machine ( windows ) using the command

java -jar XMLDriver.jar -files junkwords.txt -libjars XMLInputFormat.jar

and submit job to a linux box which runs hadoop.

I know that this distribution cache file will be sent to the HDFS on my remote box ( Am i right ???? )

But in mapper code am unable to retrieve this file name using the api

Path[] cacheFiles  = DistributedCache.getLocalCacheFiles(conf); 

fileName = cacheFiles[0].toString();

Should I use DistributedCache.addCacheFile() api and symlinks api, if so wht is the parameter URI I need to mention as I dont know where the files would be copied by hadoop on the linux box?

Also, I tried to copy the junkwords.txt file manually to hdfs and specified the hdfs path here in command line as

java -jar XMLDriver.jar -files /users/junkwords.txt -libjars XMLInputFormat.jar

This throws a FileNotFoundException when I run the job on my local windows machine.

What is the solution for accessing the distributed cached file in mapper when passed from remote machine using -file command line option?

DistributedCache.addCacheFile()

确实,您应该在设置作业时将文件添加到分布式缓存中。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM