简体   繁体   English

如何从猪脚本中运行Mapreduce

[英]How to run Mapreduce from within a pig script

I want to understand how to integrate calling a mapreduce job from within a pig script. 我想了解如何集成从猪脚本中调用mapreduce作业。

I referred to the link https://wiki.apache.org/pig/NativeMapReduce 我提到了链接https://wiki.apache.org/pig/NativeMapReduce

But I am not sure how to do it as how it will understand which is my mapper or reducer code. 但是我不确定该怎么做,因为它将如何理解哪个是我的映射器或化简器代码。 The explanation is not very clear. 解释不是很清楚。

If someone can illustrate it with an example, it will be of great Help. 如果有人可以举一个例子来说明,那将是非常有帮助的。

Thanks in Advance, Cheers :) 在此先感谢,干杯:)

Example from the pig documentation 猪文档中的示例

A = LOAD 'WordcountInput.txt';
B = MAPREDUCE 'wordcount.jar' STORE A INTO 'inputDir' LOAD 'outputDir' 
    AS (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`;

In the above example, pig will store input data from A into inputDir and load the job's output data from outputDir . 在上述例子中,猪从输入数据存储AinputDir和从装载作业的输出数据outputDir

Also, there is a jar in HDFS called wordcount.jar in which there is a class called org.myorg.WordCount with a main class which takes care of setting mappers and reducers, input and output etc. 此外,HDFS中有一个名为wordcount.jar的jar,其中有一个名为org.myorg.WordCount的类,其中的一个主类负责设置映射器和化简器,输入和输出等。

You could also call the mapreduce job via hadoop jar mymr.jar org.myorg.WordCount inputDir outputDir . 您还可以通过hadoop jar mymr.jar org.myorg.WordCount inputDir outputDir调用mapreduce作业。

By default pig will anticipate the map/reduce program. 默认情况下,pig将预期map / reduce程序。 However hadoop comes with default mapper/reducer implementations; 但是hadoop带有默认的mapper / reducer实现。 which is used by Pig - when map reduce class is not identified. Pig所使用的-当未标识map reduce类时。

Further Pig uses the properties from Hadoop along with its specific properties for this. 另外,Pig使用Hadoop的属性及其特定属性。 Try setting, below properties in pig script, it should be picked by Pig as well. 尝试设置,在Pig脚本中的属性下方,Pig也应选择该属性。

SET mapred.mapper.class="<fully qualified classname for mapper>"
SET mapred.reducer.class="<fully qualified classname for reducer>"

The same can be set using -Dmapred.mapper.class option as well. 同样可以使用-Dmapred.mapper.class选项设置。 Comprehensive list is here Based on your hadoop installation, the properties could be as well: 完整列表在此处根据您的hadoop安装,属性也可以是:

mapreduce.map.class
mapreduce.reduce.class

Just FYI... 仅供参考...

hadoop.mapred has been deprecated. hadoop.mapred已被弃用。 Versions before 0.20.1 used mapred. 0.20.1之前的版本使用mapred。 Versions after that use mapreduce. 之后的版本使用mapreduce。

Moreover pig has its own set of properties, which can be viewed using command pig -help properties 此外,Pig具有其自己的属性集,可以使用命令pig -help properties来查看

e.g. in my pig installation, below are the properties:

The following properties are supported:
    Logging:
        verbose=true|false; default is false. This property is the same as -v switch
        brief=true|false; default is false. This property is the same as -b switch
        debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO. This property is the same as -d switch
        aggregate.warning=true|false; default is true. If true, prints count of warnings
            of each type rather than logging each warning.
    Performance tuning:
        pig.cachedbag.memusage=<mem fraction>; default is 0.2 (20% of all memory).
            Note that this memory is shared across all large bags used by the application.
        pig.skewedjoin.reduce.memusagea=<mem fraction>; default is 0.3 (30% of all memory).
            Specifies the fraction of heap available for the reducer to perform the join.
        pig.exec.nocombiner=true|false; default is false.
            Only disable combiner as a temporary workaround for problems.
        opt.multiquery=true|false; multiquery is on by default.
            Only disable multiquery as a temporary workaround for problems.
        opt.fetch=true|false; fetch is on by default.
            Scripts containing Filter, Foreach, Limit, Stream, and Union can be dumped without MR jobs.
        pig.tmpfilecompression=true|false; compression is off by default.
            Determines whether output of intermediate jobs is compressed.
        pig.tmpfilecompression.codec=lzo|gzip; default is gzip.
            Used in conjunction with pig.tmpfilecompression. Defines compression type.
        pig.noSplitCombination=true|false. Split combination is on by default.
            Determines if multiple small files are combined into a single map.
        pig.exec.mapPartAgg=true|false. Default is false.
            Determines if partial aggregation is done within map phase,
            before records are sent to combiner.
        pig.exec.mapPartAgg.minReduction=<min aggregation factor>. Default is 10.
            If the in-map partial aggregation does not reduce the output num records
            by this factor, it gets disabled.
    Miscellaneous:
        exectype=mapreduce|local; default is mapreduce. This property is the same as -x switch
        pig.additional.jars.uris=<comma seperated list of jars>. Used in place of register command.
        udf.import.list=<comma seperated list of imports>. Used to avoid package names in UDF.
        stop.on.failure=true|false; default is false. Set to true to terminate on the first error.
        pig.datetime.default.tz=<UTC time offset>. e.g. +08:00. Default is the default timezone of the host.
            Determines the timezone used to handle datetime datatype and UDFs. Additionally, any Hadoop property can be specified.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM