简体   繁体   English

如何在ubuntu独立本地hadoop集群中运行mrjob库python map reduce

[英]How to run mrjob library python map reduce in ubuntu standalone local hadoop cluster

I went through documentation and it says it is meant for aws, gcp.我浏览了文档,它说它适用于 aws、gcp。 But they are also using it internally somehow right.但他们也在内部以某种方式正确使用它。 So, there should be a way to make it run in our own locally created hadoop cluster in our own virtual box所以,应该有办法让它在我们自己的虚拟机中的本地创建的hadoop集群中运行

some code for understanding how mrjob is used in code :-一些用于理解代码中如何使用 mrjob 的代码:-

class MovieSimilar(MRJob):
 def mapper_parse_input(self, key, line):
    (userID, movieID, rating, timestamp) = line.split('\t')
    yield  userID, (movieID, float(rating))
    ..........
    ..........
if __name__ == '__main__':
  MovieSimilar.run()

With hadoop streaming jar and normal python codes I am able to run python codes.But mrjob isn't accepting data-set location from command line and giving more than 2 values required to unpack.使用 hadoop 流 jar 和普通的 python 代码,我可以运行 python 代码。但 mrjob 不接受来自命令行的数据集位置,并提供 2 个以上的解包所需的值。 And that error is because it is unable to take date set given -input flag这个错误是因为它无法在给定 -input 标志的情况下获取日期集

The shell command I am using :-我正在使用的 shell 命令:-

bin/hadoop jar /usr/local/Cellar/hadoop/3.1.0/libexec/share/hadoop/tools/lib/hadoop-
streaming.jar \
-file /<path_to_mapper>/MovieSimilar.py \
-mapper /<path_to_mapper>/MovieSimilar.py \
-reducer /<path_to_reducer>/MovieSimilar.py  \
-input daily/<dataset-file>.csv \
-output daily/output

Note:- daily is my hdfs directory where datasets and result of programs get stored注意:-每天是我的 hdfs 目录,其中存储数据集和程序结果

Error message I am receiving :- more than 2 values required to unpack我收到的错误消息:- 解包需要 2 个以上的值

says it is meant for aws, gcp说它适用于 aws,gcp

Those are examples.这些都是例子。 It is not meant for those.这并不意味着那些。 Notice the -r local and -r hadoop flags for running a job注意用于运行作业的-r local-r hadoop标志

https://mrjob.readthedocs.io/en/latest/guides/runners.html#running-on-your-own-hadoop-cluster https://mrjob.readthedocs.io/en/latest/guides/runners.html#running-on-your-own-hadoop-cluster

there should be a way to make it run in our own locally created hadoop cluster in our own virtual box应该有一种方法可以让它在我们自己的虚拟机中的本地创建的 hadoop 集群中运行

Setup your HADOOP_HOME , and HADOOP_CONF_DIR xml files to point at the cluster you want to run the code against, then using the -r hadoop runner flag, it'll find and run your code using the hadoop binary and hadoop-streaming jar file设置您的HADOOP_HOMEHADOOP_CONF_DIR xml 文件以指向您要对其运行代码的集群,然后使用-r hadoop runner 标志,它将使用 hadoop 二进制文件和 hadoop-streaming jar 文件查找并运行您的代码

more than 2 values required to unpack . more than 2 values required to unpack And that error is because it is unable to take date set given -input flag这个错误是因为它无法在给定 -input 标志的情况下获取日期集

Can't see your input, but this line would cause that error if there were less than three tabs on any line (and you don't need parentheses left of the equals)看不到您的输入,但是如果任何行上的选项卡少于三个,则此行会导致该错误(并且您不需要等号左侧的括号)

(userID, movieID, rating, timestamp) = line.split('\t')

I suggest testing your code using the local/inline runner first我建议先使用本地/内联运行程序测试您的代码

The shell command I am using :-我正在使用的 shell 命令:-

bin/hadoop jar /usr/local/Cellar/hadoop/3.1.0/libexec/share/hadoop/tools/lib/hadoop- streaming.jar

Mrjob will build and submit that for you. Mrjob 将为您构建并提交该文件。

You only need to run python MovieSimilar.py with your input files您只需要使用输入文件运行python MovieSimilar.py

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM