简体   繁体   English

为什么我的任务不能在Pig中并行运行?

[英]Why my tasks does not run in parallel in Pig?

I'm learning hadoop, and I'm doing some experiment on a project that could go in production as a big data project. 我正在学习hadoop,并且正在对该项目进行一些实验,该项目可以作为大数据项目投入生产。 At the moment anyway I'm just doing some test with a small amount data. 无论如何,目前我只是用少量数据做一些测试。 The scenario is as follow there is a bounch of json files that I load in pig as below: 场景如下,我在Pig中加载了很多json文件,如下所示:

a = load 's3n://mybucket/user_*.json' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map []);
b = FOREACH a GENERATE flatten(json#'user') as (m:map[]) ;

let's say file are small, they contain just one object, but there is a bounch of them. 假设文件很小,它们只包含一个对象,但是其中有很多。 I'm supposing the FOREACH would work in parallel opening more file at once, am I wrong? 我以为FOREACH可以同时并行打开更多文件,我错了吗? Programs take a while to run about 10 seconds on an amazon c3.xlarge istance, and there is about 400 files. 程序需要一段时间才能在amazon c3.xlarge实例上运行大约10秒,并且大约有400个文件。 I'm sure if I do a program in C# it will take fraction of second to run, where am I wrong? 我确定如果我用C#编写程序,将需要几分之一秒才能运行,我在哪里错?

Pig runs task as parallel, there is some amount of time pig spends initially becuase it runs as mapreduce and optimizes the whole script, so operating on small data set will be slower in pig. Pig并行运行任务,最初由于将其作为mapreduce进行运行并优化整个脚本,所以最初花费了一定的时间,因此对小型数据集进行操作会较慢。 It should be used for big dataset. 它应该用于大型数据集。 To increase the number of parallel task in pig for small data, you can used the PARALLEL command in the FOREACH line, else you can overall increase the number of reducer by set default_parallel n, to set the parallelism to n. 要为小数据增加Pig中的并行任务数,可以在FOREACH行中使用PARALLEL命令,否则可以通过设置default_parallel n将并行度设置为n来整体增加reducer的数目。 The last case can be that pig is running all task as mapper, and the number of mapper is too small as your file size is small, you have to change some yarn configuration to increase the number of mappers. 最后一种情况是Pig正在作为映射器运行所有任务,并且由于文件大小较小,映射器的数量太少,您必须更改某些纱线配置以增加映射器的数量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM