简体繁体 English

wordCount mapReduce如何在apo tez的hadoop纱线集群上运行？

[英]How wordCount mapReduce jobs, run on hadoop yarn cluster with apache tez?

原文 2015-07-17 00:48:59 9 1 hadoop/ mapreduce/ yarn/ apache-tez

As the github page of tez says, tez is very simple and at its heart has just two components: 正如tez的github页面所说，tez非常简单，其核心只有两个组成部分：

The data-processing pipeline engine, and 数据处理管道引擎，和
A master for the data-processing application, where-by one can put together arbitrary data-processing 'tasks' described above into a task-DAG 数据处理应用程序的主人，可以将上述任意数据处理“任务”组合到任务-DAG中

Well my first question is, How existing mapreduce jobs like wordcount that exists in tez-examples.jar, converted to task-DAG? 那么我的第一个问题是，tez-examples.jar中存在的现有mapreduce作业如wordcount如何转换为task-DAG？ where? 哪里？ or they don't...? 或者他们不......？

and my second and more important question is about this part: 而我的第二个也是更重要的问题是这个部分：

Every 'task' in tez has the following: tez中的每个“任务”都有以下内容：

Input to consume key/value pairs from. 输入以消耗键/值对。
Processor to process them. 处理器来处理它们。
Output to collect the processed key/value pairs. 输出以收集已处理的键/值对。

Who is in charge of splitting input data between the tez-tasks? 谁负责在tez任务之间拆分输入数据？ Is it the code that user provide or is it Yarn (the resource manager) or even the tez itself? 它是用户提供的代码还是Yarn（资源管理器）甚至是tez本身？

The question is the same for output phase. 输出阶段的问题是相同的。 Thanks in advance 提前致谢

1 个解决方案

To answer your first question on converting MapReduce jobs to Tez DAGs: 要回答关于将MapReduce作业转换为Tez DAG的第一个问题：

Any MapReduce job can be thought of a single DAG with 2 vertices(stages). 任何MapReduce作业都可以被认为是具有2个顶点（阶段）的单个DAG。 The first vertex is the Map phase and it is connected to a downstream vertex Reduce via a Shuffle edge. 第一个顶点是Map阶段，它通过Shuffle边连接到下游顶点Reduce。

There are 2 ways in which MR jobs can be run on Tez: 有两种方法可以在Tez上运行MR作业：

One approach is to write a native 2-stage DAG using the Tez APIs directly. 一种方法是直接使用Tez API编写本机2阶段DAG。 This is what is currently present in tez-examples. 这是目前tez-examples中的内容。
The second is to use the MapReduce APIs themselves and use the yarn-tez mode. 第二种是使用MapReduce API本身并使用yarn-tez模式。 In this scenario, there is a layer which intercepts the MR Job submission and instead of using MR, it translates the MR job into a 2-stage Tez DAG and executes the DAG on the Tez runtime. 在这种情况下，有一个层拦截MR作业提交，而不是使用MR，它将MR作业转换为2阶段Tez DAG并在Tez运行时执行DAG。

For the data handling related questions that you have: 对于您拥有的数据处理相关问题：

The user provides the logic on understanding the data to be read and how to split it. 用户提供了理解要读取的数据以及如何拆分数据的逻辑。 Tez then takes each split of data and takes over the responsibility of assigning a split or a set of splits to a given task. 然后，Tez接收每个数据分割，并接管将分割或一组分割分配给给定任务的责任。

The Tez framework then controls the generation and movement of data ie where to generate the data between intermediate steps and how to move data between 2 vertices/stages. 然后，Tez框架控制数据的生成和移动，即在中间步骤之间生成数据的位置以及如何在2个顶点/阶段之间移动数据。 However, it does not control the underlying data contents/structure, partitioning or serialization logic which is provided by user plugins. 但是，它不控制用户插件提供的底层数据内容/结构，分区或序列化逻辑。

The above is just a high level view with additional intricacies. 以上只是一个高级别的视图，还有其他错综复杂的内容。 You will get more detailed answers by posting specific questions to the Development list ( http://tez.apache.org/mail-lists.html ) 通过将特定问题发布到开发列表（ http://tez.apache.org/mail-lists.html ），您将获得更详细的答案