简体   繁体   English

如何提高使用mapreduce分析日志文件的性能

[英]how to improve performance in analysing log file using mapreduce

We have to analyze the log files using hadoop as it can handle large data easily. 我们必须使用hadoop分析日志文件,因为它可以轻松处理大量数据。 So, I wrote one piece of mapreduce program. 因此,我编写了一个mapreduce程序。 But even my mapreduce program is taking lot of time to get the data. 但是,即使我的mapreduce程序也花费大量时间来获取数据。

String keys[] = value.toString().split(" ");
        int keysLength = keys.length;
        if(keysLength > 4 && StringUtils.isNumeric(keys[keysLength-5])) {
            this.keyWords.set(keys[0]+"-"+keys[1]+" "+keys[2]+" "+keys[keysLength-5]+" "+keys[keysLength-2]);
            context.write(new IntWritable(1), keyWords);
        }

The requirement is, we will have mostly 10 to 15 of .gz files and every .gz file have one log file inside. 要求是,我们最多将有10到15个.gz文件,每个.gz文件中都包含一个日志文件。 we have to pull the data from that log file to analyze it. 我们必须从该日志文件中提取数据以进行分析。

Sample input in the log file : 在日志文件中输入样本

2015-09-12 03:39:45.201 [service_client] [anhgv-63ac7ca63ac] [[ACTIVE] ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)'] INFO TempServerImplementation - || 2015-09-12 03:39:45.201 [service_client] [anhgv-63ac7ca63ac] [[ACTIVE] ExecuteThread:队列的'0':'weblogic.kernel.Default(自调整)'] INFO TempServerImplementation-|| Server: loclhost 121.1.0.0 | 服务器:loclhost 121.1.0.0 | Service Category: Testing | 服务类别:测试| Service Method: add | 服务方法:添加| Application Id: Test | 应用ID:测试| Status Code: 200 | 状态码:200 | Duration: 594ms || 持续时间:594ms ||

So could someone help me how can I tune up the performance. 因此有人可以帮助我如何调整性能。

Thanks Sai 谢赛

You can try using SPARK(We can think this as in memory map reduce), it is 10x to 100x faster than traditional Map reduce. 您可以尝试使用SPARK(我们可以认为这是在内存映射减少中),它比传统Map减少快10到100倍。 Please check trade-offs between hadoop map-reduce and SPARK before using. 使用前,请检查hadoop map-reduce和SPARK之间的权衡。

There are two main ways you can speed up your job, input size and variable initialisation . 您可以通过两种主要方法来加快工作速度, 输入大小变量初始化

Input Size 输入尺寸

gz is not a splittable format. gz不是可拆分格式。 That means that if you have 15 input gz files, you will only have 15 mappers. 这意味着,如果您有15个输入gz文件,则将只有15个映射器。 I can see from the comments that each gz file is 50MB, so at a generous 10:1 compression ratio, each mapper would be processing 500MB. 从注释中可以看到,每个gz文件为50MB,因此在10:1的压缩比下,每个映射器将处理500MB。 This can take time, and unless you've got a <15 node cluster, you'll have nodes that are doing nothing. 这可能会花费一些时间,并且除非您的节点集群少于15个,否则您的节点将无所事事。 By uncompressing the data before the MR job you could have more mappers which would reduce the runtime. 通过在MR作业之前解压缩数据,您可以拥有更多的映射器,这将减少运行时间。

Variable Initialisation 变量初始化

In the below line: 在下面的行中:

context.write(new IntWritable(1), keyWords);

you're generating a big overheard by allocating a brand new IntWritable for each output. 通过为每个输出分配一个全新的IntWritable ,您将引起巨大的IntWritable Instead, why not allocate it at the top of the class? 相反,为什么不将其分配在班级顶部? It doesn't change, so it doesn't need allocating each time. 它不会改变,因此不需要每次分配。

For example: 例如:

private static final IntWritable ONE_WRITABLE = new IntWritable(1);
...
context.write(ONE_WRITABLE, keyWords);

The same applies to the strings you use - " " and "-" . 您使用的字符串- " ""-"同样适用。 Assign them as static variables also and again avoid creating fresh ones each time. 也将它们分配为静态变量,并再次避免每次都创建新的变量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM