Linux 上的 MapReduce Hadoop - 輸入上有多個數據

Question

我在 Virtual Box 上使用 Ubuntu 20.10 和 Hadoop 版本 3.2.1（如果您需要更多信息，請評論我）。
我的 output 此刻給了我這個：

Aaron Wells Peirsol ,M,17,United States,Swimming,2000 Summer,0,1,0
Aaron Wells Peirsol ,M,21,United States,Swimming,2004 Summer,1,0,0
Aaron Wells Peirsol ,M,25,United States,Swimming,2008 Summer,0,1,0
Aaron Wells Peirsol ,M,25,United States,Swimming,2008 Summer,1,0,0

對於上面的 output 我希望能夠總結他所有的獎牌
（字符串末尾的三個數字分別代表金、銀、銅
參賽者多年來在奧運會上獲得的獎牌）。

該項目沒有具體說明哪個年齡（17、21、25、25）
或者當它發生時（2000,2004,2008,2008 夏季），但我必須添加獎牌
為了能夠按獲得最多金牌的參與者等對它們進行排序。

有任何想法嗎？ 如果您需要，我可以為您提供我的代碼，但我需要另一個 MapReduce 我想這將使用我在上面導入的給定輸入並給我們類似的內容：

Aaron Wells Peirsol,M,25,United States,Swimming,2008 Summer,2,2,0

如果我們有辦法從 reduce output 中刪除“\t”，那也是非常有益的！

謝謝大家的時間，Gyftonikolos Nikolaos。

Answer 1

雖然一開始可能看起來有點棘手，但這是 WordCount 示例的另一種情況，只是這一次需要復合鍵和值，以便將數據從映射器以key-value對的形式輸入到化簡器中。

對於映射器，我們需要從輸入文件的每一行中提取所有信息，並將列中的數據分為兩個“類別”：

每個運動員的key信息始終相同
逐行更改的統計信息，需要對其進行編輯

對於每個運動員的台詞，我們知道永遠不會改變的列是運動員的姓名、性別、國家和運動項目。 通過使用,字符作為每種數據類型之間的分隔符，所有這些都將被視為key 。 列數據的 rest 將放在key-value對的值側，但我們也需要在它們上使用分隔符，以便首先區分每個年齡和奧運會年份的獎牌計數器。 我們將使用：

@字符作為年齡和年份之間的分隔符，
#字符作為獎牌計數器之間的分隔符，
和_字符作為這兩者之間的分隔符

在Reduce function 中，我們所要做的實際上就是計算獎牌數以找到它們的總數並找到每個運動員的最新年齡和年份。

為了在 MapReduce 作業的 output 的鍵和值之間沒有制表符，我們可以簡單地將NULL key-value設置為由 reducer 生成的所有數據的鍵值對每對，使用,字符作為分隔符。

此作業的代碼如下所示：

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import java.io.*;
import java.io.IOException;
import java.util.*;
import java.nio.charset.StandardCharsets;


public class Medals 
{
    /* input:  <byte_offset, line_of_dataset>
     * output: <(name,sex,country,sport), (age@year_gold#silver#bronze)>
     */
    public static class Map extends Mapper<Object, Text, Text, Text> 
    {
        public void map(Object key, Text value, Context context) throws IOException, InterruptedException 
        {
            String record = value.toString();
            String[] columns = record.split(",");

            // extract athlete's main info
            String name = columns[0];
            String sex = columns[1];
            String country = columns[3];
            String sport = columns[4];

            // extract athlete's stat info
            String age = columns[2];
            String year = columns[5]; 
            String gold = columns[6];
            String silver = columns[7];
            String bronze = columns[8];

            // set the main info as key and the stat info as value
            context.write(new Text(name + "," + sex + "," + country + "," + sport), new Text(age + "@" + year + "_" +  gold + "#" + silver + "#" + bronze));
        }
    }

    /* input:  <(name,sex,country,sport), (age@year_gold#silver#bronze)>
     * output: <(NULL, (name,sex,age,country,sport,year,golds,silvers,bronzes)>
     */
    public static class Reduce extends Reducer<Text, Text, NullWritable, Text>
    {
        public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException 
        {
            // extract athlete's main info
            String[] athlete_info = key.toString().split(",");
            String name = athlete_info[0];
            String sex = athlete_info[1];
            String country = athlete_info[2];
            String sport = athlete_info[3];

            int latest_age = 0;
            String latest_games = "";
            
            int gold_cnt = 0;
            int silver_cnt = 0;
            int bronze_cnt = 0;

            // for a single athlete, compute their stats...
            for(Text value : values)
            {
                String[] split_value = value.toString().split("_");
                String[] age_and_year = split_value[0].split("@");
                String[] medals = split_value[1].split("#");

                // find the last age and games the athlete has stats in the input file
                if(Integer.parseInt(age_and_year[0]) > latest_age)
                {
                    latest_age = Integer.parseInt(age_and_year[0]);
                    latest_games = age_and_year[1];
                }
                
                if(Integer.parseInt(medals[0]) == 1)
                    gold_cnt++;

                if(Integer.parseInt(medals[1]) == 1)
                    silver_cnt++;

                if(Integer.parseInt(medals[2]) == 1)
                    bronze_cnt++;
            }

            context.write(NullWritable.get(), new Text(name + "," + sex + "," + String.valueOf(latest_age) + "," + country + "," + sport + "," + latest_games + "," + String.valueOf(gold_cnt) + "," + String.valueOf(silver_cnt) + "," + String.valueOf(bronze_cnt)));
        }
    }


    public static void main(String[] args) throws Exception
    {
        // set the paths of the input and output directories in the HDFS
        Path input_dir = new Path("olympic_stats");
        Path output_dir = new Path("medals");

        // in case the output directory already exists, delete it
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        if(fs.exists(output_dir))
            fs.delete(output_dir, true);

        // configure the MapReduce job
        Job medals_job = Job.getInstance(conf, "Medals Counter");
        medals_job.setJarByClass(Medals.class);
        medals_job.setMapperClass(Map.class);
        medals_job.setReducerClass(Reduce.class);    
        medals_job.setMapOutputKeyClass(Text.class);
        medals_job.setMapOutputValueClass(Text.class);
        medals_job.setOutputKeyClass(NullWritable.class);
        medals_job.setOutputValueClass(Text.class);
        FileInputFormat.addInputPath(medals_job, input_dir);
        FileOutputFormat.setOutputPath(medals_job, output_dir);
        medals_job.waitForCompletion(true);
    }
}

當然，結果是您希望它如下所示：

Linux 上的 MapReduce Hadoop - 輸入上有多個數據

問題描述

1 個解決方案

解決方案1
1 已采納 2020-12-05 18:16:41

Linux 上的 MapReduce Hadoop - 輸入上有多個數據

問題描述

1 個解決方案

解決方案1 1 已采納 2020-12-05 18:16:41

解決方案1
1 已采納 2020-12-05 18:16:41