从 java map reduce 代码中跳过标头

Question

我正在尝试获取 csv 文件的摘要，文件的第一行是标题。 有没有一种方法可以使每个列的值及其标题名称作为 Java 代码中的键值对。

例如：输入文件就像

A B C D

1,2,3,4

5,6,7,8

我希望映射器的输出为(A,1),(B,2),(C,3),(D,4),(A,5),....

注意：我尝试使用覆盖 Mapper 类中的运行函数来跳过第一行。 但据我所知，每次输入拆分都会调用运行函数，因此不适合我的需要。 对此的任何帮助将不胜感激。

这是我的映射器的样子：

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        String line = value.toString();
        String[] splits = line.split(",",-1);
        int length = splits.length;
    //  count = 0;

        for (int i = 0; i < length; i++) {
            columnName.set(header[i]);      
            context.write(columnName, new Text(splits[i]+""));
        }

    }

    public void run(Context context) throws IOException, InterruptedException
    {        
        setup(context); 
        try 
        {

            if (context.nextKeyValue())
            { 

                Text columnHeader = context.getCurrentValue();
                header =  columnHeader.toString().split(",");

            }    
            while (context.nextKeyValue()) 
            {
                map(context.getCurrentKey(), context.getCurrentValue(), context);
            }
        } 
        finally 
        {
            cleanup(context);
        }      
    }

Answer 1

我假设列标题是字母，列值是数字。

实现这一目标的方法之一是使用DistributedCache 。 以下是步骤：

创建一个包含列标题的文件。
在驱动程序代码中，通过调用Job::addCacheFile()将此文件添加到分布式缓存中
在映射器的setup()方法中，从分布式缓存访问此文件。 解析文件内容并将其存储在columnHeader列表中。
在map()方法中，检查每条记录中的值是否与标题匹配（存储在columnnHeader列表中）。 如果是，则忽略该记录（因为该记录只包含标题）。 如果否，则将值与列标题一起发出。

映射器和驱动程序代码如下所示：

司机：

public static void main(String[] args) throws Exception {

    Configuration conf = new Configuration();

    Job job = Job.getInstance(conf, "HeaderParser");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(HeaderParserMapper.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(NullWritable.class);

    job.addCacheFile(new URI("/in/header.txt#header.txt"));
    FileInputFormat.addInputPath(job, new Path("/in/in7.txt"));
    FileOutputFormat.setOutputPath(job, new Path("/out/"));

    System.exit(job.waitForCompletion(true) ? 0:1);
}

驱动逻辑：

将“header.txt”（仅包含一行：A、B、C、D）复制到 HDFS
在驱动程序中，通过执行以下语句将“header.txt”添加到分布式缓存中：
```
 job.addCacheFile(new URI("/in/header.txt#header.txt"));
```

映射器：

public static class HeaderParserMapper
        extends Mapper<LongWritable, Text , Text, NullWritable>{

    String[] headerList;
    String header;

    @Override
    protected void setup(Mapper.Context context) throws IOException, InterruptedException {
        BufferedReader bufferedReader = new BufferedReader(new FileReader("header.txt"));
        header = bufferedReader.readLine();
        headerList = header.split(",");
    }

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        String line = value.toString();
        String[] values = line.split(",");

        if(headerList.length == values.length && !header.equals(line)) {
            for(int i = 0; i < values.length; i++)
                context.write(new Text(headerList[i] + "," + values[i]), NullWritable.get());
        }
    }
}

映射器逻辑：

重写setup()方法。
在setup()方法中读取“header.txt”（放在Driver的分布式缓存中）。
在map()方法中，检查该行是否与标题匹配。 如果是，则忽略该行。 否则，将标题和值输出为 (h1,v1)、(h2,v2)、(h3,v3) 和 (h4,v4)。

我在以下输入上运行了这个程序：

A,B,C,D
1,2,3,4
5,6,7,8

我得到以下输出（其中值与相应的标头匹配）：

A,1
A,5
B,2
B,6
C,3
C,7
D,4
D,8

Answer 2

@Manjunath Ballur 接受的答案是一个很好的技巧。 但是，为了简单起见，必须结合使用 Map Reduce。 不推荐检查每一行的标题。

一种方法是编写一个自定义InputFormat来为您完成这项工作

从 java map reduce 代码中跳过标头

问题描述

2 个解决方案

解决方案1
1 已采纳 2015-12-21 18:40:53

解决方案2
0 2020-04-26 04:26:57

从 java map reduce 代码中跳过标头

问题描述

2 个解决方案

解决方案1 1 已采纳 2015-12-21 18:40:53

解决方案2 0 2020-04-26 04:26:57

解决方案1
1 已采纳 2015-12-21 18:40:53

解决方案2
0 2020-04-26 04:26:57