[英]Mapreduce with C#: Process whole input files
Problem: 问题:
I'm creating a MapReduce application in C# for HDInsight. 我正在C#中为HDInsight创建MapReduce应用程序。 I need to process whole input files.
我需要处理整个输入文件。
I understand, there are two options available in Hadoop to achieve this: 我了解,Hadoop有两种方法可以实现此目的:
I can't figure out, how to achieve any of these options using C# on HDInsight. 我不知道如何在HDInsight上使用C#实现任何这些选项。
Details: 细节:
I'm either 我要么
Using Microsoft.Hadoop.MapReduce, and starting the job via hadoop.MapReduceJob.ExecuteJob<MyJob>();
使用Microsoft.Hadoop.MapReduce,并通过
hadoop.MapReduceJob.ExecuteJob<MyJob>();
启动作业hadoop.MapReduceJob.ExecuteJob<MyJob>();
Or by simply creating a console application and starting it from azure powershell via 或通过简单地创建一个控制台应用程序并通过azure powershell启动它
$mrJobDef = New-AzureHDInsightStreamingMapReduceJobDefinition -JobName MyJob -StatusFolder $mrStatusOutput -Mapper $mrMapper -Reducer $mrReducer -InputPath $mrInput -OutputPath $mrOutput
$mrJobDef.Files.Add($mrMapperFile)
$mrJob = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $mrJobDef
A solution for either way would help a lot. 两种方法的解决方案都会有很大帮助。
You can set min_splitsize using the -Defines parameter in powershell 您可以使用powershell中的-Defines参数设置min_splitsize
$clusterName = "YourClusterName"
$jobConfig = @{ "min_splitsize"="512mb"; "mapred.output.compression.codec"="org.apache.hadoop.io.compress.GzipCodec" }
$myWordCountJob = New-AzureHDInsightMapReduceJobDefinition -JarFile "/example/jars/hadoop-examples.jar" -ClassName "wordcount" -jobName "WordCountJob" -StatusFolder "/MyMRJobs/WordCountJobStatus" -Defines $jobConfig
or in C# 或在C#中
var mapReduceJob = new MapReduceJobCreateParameters()
{
ClassName = "wordcount", // required
JobName = "MyWordCountJob", //optional
JarFile = "/example/jars/hadoop-examples.jar", // Required, alternative syntax: wasb://hdijobs@azimasv2.blob.core.windows.net/example/jar/hadoop-examples.jar
StatusFolder = "/AzimMRJobs/WordCountJobStatus" //Optional, but good to use to know where logs are uploaded in Azure Storage
};
mapReduceJob.Defines.Add("min_splitsize", "512mb");
Although I don't think this guarantees that each file will be read completely. 尽管我认为这不能保证每个文件都可以完全读取。 To do that you may need the Java SDK explained here http://www.andrewsmoll.com/3-hacks-for-hadoop-and-hdinsight-clusters/
为此,您可能需要在此处http://www.andrewsmoll.com/3-hacks-for-hadoop-and-hdinsight-clusters/中说明的Java SDK
Resources: http://blogs.msdn.com/b/bigdatasupport/archive/2014/02/13/how-to-pass-hadoop-configuration-values-for-a-job-via-hdinsight-powershell-and-net-sdk.aspx 资源: http : //blogs.msdn.com/b/bigdatasupport/archive/2014/02/13/how-to-pass-hadoop-configuration-values-for-a-job-via-hdinsight-powershell-and-净sdk.aspx
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.