Hadoop MapReduce job completes successfully but doesn't write anything to DB

Question

I'm writing a MR job to mine webserver logs. The input for the job is from text files, output goes to a MySQL database. Problem is, the job completes successfully but doesn't write anything to the DB. I've not done MR programming for a while so it most likely is a bug that I'm not able to find. It is not the pattern matching (see below), that I've unit tested and works fine. What am I missing? Mac OS X, Oracle JDK 1.8.0_31, hadoop-2.6.0 Note: The exceptions are logged, I've omitted them for brevity.

SkippableLogRecord:

public class SkippableLogRecord implements WritableComparable<SkippableLogRecord> {
    // fields

    public SkippableLogRecord(Text line) {
        readLine(line.toString());
    }
    private void readLine(String line) {
        Matcher m = PATTERN.matcher(line);

        boolean isMatchFound = m.matches() && m.groupCount() >= 5;

        if (isMatchFound) {
        try {
            jvm = new Text(m.group("jvm"));

            Calendar cal = getInstance();
            cal.setTime(new SimpleDateFormat(DATE_FORMAT).parse(m
            .group("date")));

            day = new IntWritable(cal.get(DAY_OF_MONTH));
            month = new IntWritable(cal.get(MONTH));
            year = new IntWritable(cal.get(YEAR));

            String p = decode(m.group("path"), UTF_8.name());

            root = new Text(p.substring(1, p.indexOf(FILE_SEPARATOR, 1)));
            filename = new Text(
            p.substring(p.lastIndexOf(FILE_SEPARATOR) + 1));
            path = new Text(p);

            status = new IntWritable(Integer.parseInt(m.group("status")));
            size = new LongWritable(Long.parseLong(m.group("size")));
        } catch (ParseException | UnsupportedEncodingException e) {
            isMatchFound = false;
        }
    }

    public boolean isSkipped() {
        return jvm == null;
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        jvm.readFields(in);
        day.readFields(in);
        // more code
    }
    @Override
    public void write(DataOutput out) throws IOException {
        jvm.write(out);
        day.write(out);
        // more code
    }
    @Override
    public int compareTo(SkippableLogRecord other) {...}
    @Override
    public boolean equals(Object obj) {...}
}

Mapper:

public class LogMapper extends
    Mapper<LongWritable, Text, SkippableLogRecord, NullWritable> {    
    @Override
    protected void map(LongWritable key, Text line, Context context) {
        SkippableLogRecord rec = new SkippableLogRecord(line);

        if (!rec.isSkipped()) {
            try {
                context.write(rec, NullWritable.get());
            } catch (IOException | InterruptedException e) {...}
        }
    }
}

Reducer:

public class LogReducer extends
    Reducer<SkippableLogRecord, NullWritable, DBRecord, NullWritable> {    
    @Override
    protected void reduce(SkippableLogRecord rec,
        Iterable<NullWritable> values, Context context) {
        try {
            context.write(new DBRecord(rec), NullWritable.get());
        } catch (IOException | InterruptedException e) {...}
    }
}

DBRecord:

public class DBRecord implements Writable, DBWritable {
    // fields

    public DBRecord(SkippableLogRecord logRecord) {
        jvm = logRecord.getJvm().toString();
        day = logRecord.getDay().get();
        // more code for rest of the fields
    }

    @Override
    public void readFields(ResultSet rs) throws SQLException {
        jvm = rs.getString("jvm");
        day = rs.getInt("day");
        // more code for rest of the fields
    }

    @Override
    public void write(PreparedStatement ps) throws SQLException {
        ps.setString(1, jvm);
        ps.setInt(2, day);
        // more code for rest of the fields
    }
}

Driver:

public class Driver extends Configured implements Tool {
    @Override
    public int run(String[] args) throws Exception {
        Configuration conf = getConf();

        DBConfiguration.configureDB(conf, "com.mysql.jdbc.Driver", // driver
        "jdbc:mysql://localhost:3306/aac", // db url
        "***", // user name
        "***"); // password

        Job job = Job.getInstance(conf, "log-miner");

        job.setJarByClass(getClass());
        job.setMapperClass(LogMapper.class);
        job.setReducerClass(LogReducer.class);
        job.setMapOutputKeyClass(SkippableLogRecord.class);
        job.setMapOutputValueClass(NullWritable.class);
        job.setOutputKeyClass(DBRecord.class);
        job.setOutputValueClass(NullWritable.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(DBOutputFormat.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));

        DBOutputFormat.setOutput(job, "log", // table name
        new String[] { "jvm", "day", "month", "year", "root",
            "filename", "path", "status", "size" } // table columns
        );

        return job.waitForCompletion(true) ? 0 : 1;
    }
    public static void main(String[] args) throws Exception {
        GenericOptionsParser parser = new GenericOptionsParser(
        new Configuration(), args);

        ToolRunner.run(new Driver(), parser.getRemainingArgs());
    }
}

Job execution log:

15/02/28 02:17:58 INFO mapreduce.Job:  map 100% reduce 100%
15/02/28 02:17:58 INFO mapreduce.Job: Job job_local166084441_0001 completed successfully
15/02/28 02:17:58 INFO mapreduce.Job: Counters: 35
    File System Counters
        FILE: Number of bytes read=37074
        FILE: Number of bytes written=805438
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=476788498
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=11
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Map-Reduce Framework
        Map input records=482230
        Map output records=0
        Map output bytes=0
        Map output materialized bytes=12
        Input split bytes=210
        Combine input records=0
        Combine output records=0
        Reduce input groups=0
        Reduce shuffle bytes=12
        Reduce input records=0
        Reduce output records=0
        Spilled Records=0
        Shuffled Maps =2
        Failed Shuffles=0
        Merged Map outputs=2
        GC time elapsed (ms)=150
        Total committed heap usage (bytes)=1381498880
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=171283337
    File Output Format Counters 
        Bytes Written=0

Answer 1

To answer my own question, the issue was leading whitespaces which caused the matcher to fail. The unit test didn't test with leading whitespaces but the actual logs had those for some reason. Another issue with the code posted above was that all the fields in the class were initialized in the readLine method. As @Anony-Mousse had mentioned, this is expensive because Hadoop datatypes are designed to be reused. It also caused a bigger problem with serialization and deserialization. When Hadoop tried to reconstruct the class by calling readFields , it caused a NPE because all the fields were null. I also made other minor improvements using some Java 8 classes and syntax. In the end, even though I got it working, I rewrote the code using Spring Boot, Spring Data JPA and Spring's support for asynchronous processing using @Async .

Hadoop MapReduce job completes successfully but doesn't write anything to DB

Question

1 answers

solution1
0 2015-03-07 00:29:37

Hadoop MapReduce job completes successfully but doesn't write anything to DB

Question

1 answers

solution1 0 2015-03-07 00:29:37

solution1
0 2015-03-07 00:29:37