[英]MapReduce Hadoop on Linux - Multiple data on input
我在 Virtual Box 上使用 Ubuntu 20.10 和 Hadoop 版本 3.2.1(如果您需要更多信息,請評論我)。
我的 output 此刻給了我這個:
Aaron Wells Peirsol ,M,17,United States,Swimming,2000 Summer,0,1,0
Aaron Wells Peirsol ,M,21,United States,Swimming,2004 Summer,1,0,0
Aaron Wells Peirsol ,M,25,United States,Swimming,2008 Summer,0,1,0
Aaron Wells Peirsol ,M,25,United States,Swimming,2008 Summer,1,0,0
對於上面的 output 我希望能夠總結他所有的獎牌
(字符串末尾的三個數字分別代表金、銀、銅
參賽者多年來在奧運會上獲得的獎牌)。
該項目沒有具體說明哪個年齡(17、21、25、25)
或者當它發生時(2000,2004,2008,2008 夏季),但我必須添加獎牌
為了能夠按獲得最多金牌的參與者等對它們進行排序。
有任何想法嗎? 如果您需要,我可以為您提供我的代碼,但我需要另一個 MapReduce 我想這將使用我在上面導入的給定輸入並給我們類似的內容:
Aaron Wells Peirsol,M,25,United States,Swimming,2008 Summer,2,2,0
如果我們有辦法從 reduce output 中刪除“\t”,那也是非常有益的!
謝謝大家的時間,Gyftonikolos Nikolaos。
雖然一開始可能看起來有點棘手,但這是 WordCount 示例的另一種情況,只是這一次需要復合鍵和值,以便將數據從映射器以key-value
對的形式輸入到化簡器中。
對於映射器,我們需要從輸入文件的每一行中提取所有信息,並將列中的數據分為兩個“類別”:
key
信息始終相同對於每個運動員的台詞,我們知道永遠不會改變的列是運動員的姓名、性別、國家和運動項目。 通過使用,
字符作為每種數據類型之間的分隔符,所有這些都將被視為key
。 列數據的 rest 將放在key-value
對的值側,但我們也需要在它們上使用分隔符,以便首先區分每個年齡和奧運會年份的獎牌計數器。 我們將使用:
@
字符作為年齡和年份之間的分隔符,#
字符作為獎牌計數器之間的分隔符,_
字符作為這兩者之間的分隔符在Reduce
function 中,我們所要做的實際上就是計算獎牌數以找到它們的總數並找到每個運動員的最新年齡和年份。
為了在 MapReduce 作業的 output 的鍵和值之間沒有制表符,我們可以簡單地將NULL
key-value
設置為由 reducer 生成的所有數據的鍵值對每對,使用,
字符作為分隔符。
此作業的代碼如下所示:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import java.io.*;
import java.io.IOException;
import java.util.*;
import java.nio.charset.StandardCharsets;
public class Medals
{
/* input: <byte_offset, line_of_dataset>
* output: <(name,sex,country,sport), (age@year_gold#silver#bronze)>
*/
public static class Map extends Mapper<Object, Text, Text, Text>
{
public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{
String record = value.toString();
String[] columns = record.split(",");
// extract athlete's main info
String name = columns[0];
String sex = columns[1];
String country = columns[3];
String sport = columns[4];
// extract athlete's stat info
String age = columns[2];
String year = columns[5];
String gold = columns[6];
String silver = columns[7];
String bronze = columns[8];
// set the main info as key and the stat info as value
context.write(new Text(name + "," + sex + "," + country + "," + sport), new Text(age + "@" + year + "_" + gold + "#" + silver + "#" + bronze));
}
}
/* input: <(name,sex,country,sport), (age@year_gold#silver#bronze)>
* output: <(NULL, (name,sex,age,country,sport,year,golds,silvers,bronzes)>
*/
public static class Reduce extends Reducer<Text, Text, NullWritable, Text>
{
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException
{
// extract athlete's main info
String[] athlete_info = key.toString().split(",");
String name = athlete_info[0];
String sex = athlete_info[1];
String country = athlete_info[2];
String sport = athlete_info[3];
int latest_age = 0;
String latest_games = "";
int gold_cnt = 0;
int silver_cnt = 0;
int bronze_cnt = 0;
// for a single athlete, compute their stats...
for(Text value : values)
{
String[] split_value = value.toString().split("_");
String[] age_and_year = split_value[0].split("@");
String[] medals = split_value[1].split("#");
// find the last age and games the athlete has stats in the input file
if(Integer.parseInt(age_and_year[0]) > latest_age)
{
latest_age = Integer.parseInt(age_and_year[0]);
latest_games = age_and_year[1];
}
if(Integer.parseInt(medals[0]) == 1)
gold_cnt++;
if(Integer.parseInt(medals[1]) == 1)
silver_cnt++;
if(Integer.parseInt(medals[2]) == 1)
bronze_cnt++;
}
context.write(NullWritable.get(), new Text(name + "," + sex + "," + String.valueOf(latest_age) + "," + country + "," + sport + "," + latest_games + "," + String.valueOf(gold_cnt) + "," + String.valueOf(silver_cnt) + "," + String.valueOf(bronze_cnt)));
}
}
public static void main(String[] args) throws Exception
{
// set the paths of the input and output directories in the HDFS
Path input_dir = new Path("olympic_stats");
Path output_dir = new Path("medals");
// in case the output directory already exists, delete it
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
if(fs.exists(output_dir))
fs.delete(output_dir, true);
// configure the MapReduce job
Job medals_job = Job.getInstance(conf, "Medals Counter");
medals_job.setJarByClass(Medals.class);
medals_job.setMapperClass(Map.class);
medals_job.setReducerClass(Reduce.class);
medals_job.setMapOutputKeyClass(Text.class);
medals_job.setMapOutputValueClass(Text.class);
medals_job.setOutputKeyClass(NullWritable.class);
medals_job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(medals_job, input_dir);
FileOutputFormat.setOutputPath(medals_job, output_dir);
medals_job.waitForCompletion(true);
}
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.