简体   繁体   English

HBase mapreduce作业如何与服务器通信? (新手问题)

[英]How does HBase mapreduce job communicate with server? (newbie question)

I am new to Hadoop and HBase and even though I've read allot, I still don't understand the basic hierarchy and workflow of map reduce job API. 我是Hadoop和HBase的新手,即使我已经阅读过分配,但我仍然不了解MapReduce作业API的基本层次结构和工作流程。

By what I understand, I will need to use the java API to implement certain classes and pass them to hbase which will coordinate the splitting and distribution process. 据我了解,我将需要使用Java API来实现某些类,并将它们传递给hbase,后者将协调拆分和分发过程。 Is that correct? 那是对的吗?

If so, how does the application communicate with the server to pass the relevant code for the map reduce job? 如果是这样,应用程序如何与服务器通信以传递地图减少作业的相关代码? I have a missing link here.... 我这里缺少链接...。

Thanks 谢谢

When you run your HBase MapReduce job, your classpath has to contain both the HBase and MapReduce configuration files. 当您运行HBase MapReduce作业时,您的类路径必须同时包含HBase和MapReduce配置文件。 The configuration files will contain settings such as the location of the JobTracker, the HDFS NameNode, and the HBase master node. 配置文件将包含设置,例如JobTracker,HDFS NameNode和HBase主节点的位置。 The runtime will then automatically pick up all these settings from the configuration files so that your job knows which servers to contact. 然后,运行时将自动从配置文件中提取所有这些设置,以便您的作业知道要联系的服务器。

I think you should just work through the basic tutorial , which should make things clear. 我认为您应该只完成基础教程 ,这应该使事情变得清楚。 I found the quickest way to get started was by playing with the Cloudera VM . 我发现最快的入门方法是使用Cloudera VM

Also, I'm not sure about your reference to HBase; 另外,我不确定您对HBase的引用。 you should be passing Java classes to Hadoop, not HBase. 您应该将Java类传递给Hadoop,而不是HBase。

However, in an attempt to answer you question, Hadoop should be installed on all nodes in your cluster. 但是,为了回答您的问题,应在群集中的所有节点上安装Hadoop。 The Hadoop framework will take care of farming the map and reduce tasks out to nodes. Hadoop框架将负责管理地图并减少任务到节点。

The standard way to execute a M/R job using HBase is the same way you execute a non-HBase m/r job: ${HADOOP_HOME}/bin/hadoop jar .jar [args] 使用HBase执行M / R作业的标准方法与执行非HBase m / r作业的方法相同:$ {HADOOP_HOME} / bin / hadoop jar .jar [args]

This copies your jar to all of the task trackers (via HDFS) so that they can execute your code. 这会将您的jar复制到所有任务跟踪器(通过HDFS),以便它们可以执行您的代码。

With HBase you also will typically use the HBase utility: TableMapReduceUtil.initTableReducerJob 使用HBase,您通常还会使用HBase实用程序:TableMapReduceUtil.initTableReducerJob

This uses built-in algorithms to split an HBase table (using the regions of the table) so that computation can be distributed over the m/r jobs. 这使用内置算法来拆分HBase表(使用表的区域),以便可以将计算分布在m / r个作业上。 If you want a different split, you have to modify the way splits are calculated, which means that you cannot use the built-in utility. 如果要使用其他拆分,则必须修改拆分的计算方式,这意味着您无法使用内置实用程序。

The other thing you can specify is conditions on the rows that are returned. 您可以指定的另一件事是返回的行上的条件。 If you use a built-in scan condition, then you don't have to do anything special. 如果您使用内置扫描条件,则无需执行任何特殊操作。 However, if you want to create a custom comparator, you have to make sure that the region servers have this code in their classpath so that they can execute it. 但是,如果要创建自定义比较器,则必须确保区域服务器在其类路径中具有此代码,以便它们可以执行它。 Before you go this route, examine the built-in comparators carefully, as they are quite powerful. 在走这条路线之前,请仔细检查内置比较器,因为它们非常强大。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM