简体   繁体   中英

How to run mrjob library python map reduce in ubuntu standalone local hadoop cluster

I went through documentation and it says it is meant for aws, gcp. But they are also using it internally somehow right. So, there should be a way to make it run in our own locally created hadoop cluster in our own virtual box

some code for understanding how mrjob is used in code :-

class MovieSimilar(MRJob):
 def mapper_parse_input(self, key, line):
    (userID, movieID, rating, timestamp) = line.split('\t')
    yield  userID, (movieID, float(rating))
    ..........
    ..........
if __name__ == '__main__':
  MovieSimilar.run()

With hadoop streaming jar and normal python codes I am able to run python codes.But mrjob isn't accepting data-set location from command line and giving more than 2 values required to unpack. And that error is because it is unable to take date set given -input flag

The shell command I am using :-

bin/hadoop jar /usr/local/Cellar/hadoop/3.1.0/libexec/share/hadoop/tools/lib/hadoop-
streaming.jar \
-file /<path_to_mapper>/MovieSimilar.py \
-mapper /<path_to_mapper>/MovieSimilar.py \
-reducer /<path_to_reducer>/MovieSimilar.py  \
-input daily/<dataset-file>.csv \
-output daily/output

Note:- daily is my hdfs directory where datasets and result of programs get stored

Error message I am receiving :- more than 2 values required to unpack

says it is meant for aws, gcp

Those are examples. It is not meant for those. Notice the -r local and -r hadoop flags for running a job

https://mrjob.readthedocs.io/en/latest/guides/runners.html#running-on-your-own-hadoop-cluster

there should be a way to make it run in our own locally created hadoop cluster in our own virtual box

Setup your HADOOP_HOME , and HADOOP_CONF_DIR xml files to point at the cluster you want to run the code against, then using the -r hadoop runner flag, it'll find and run your code using the hadoop binary and hadoop-streaming jar file

more than 2 values required to unpack . And that error is because it is unable to take date set given -input flag

Can't see your input, but this line would cause that error if there were less than three tabs on any line (and you don't need parentheses left of the equals)

(userID, movieID, rating, timestamp) = line.split('\t')

I suggest testing your code using the local/inline runner first

The shell command I am using :-

bin/hadoop jar /usr/local/Cellar/hadoop/3.1.0/libexec/share/hadoop/tools/lib/hadoop- streaming.jar

Mrjob will build and submit that for you.

You only need to run python MovieSimilar.py with your input files

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM