I'm writing a MR job to mine webserver logs. The input for the job is from text files, output goes to a MySQL database. Problem is, the job completes successfully but doesn't write anything to the DB. I've not done MR programming for a while so it most likely is a bug that I'm not able to find. It is not the pattern matching (see below), that I've unit tested and works fine. What am I missing? Mac OS X, Oracle JDK 1.8.0_31, hadoop-2.6.0
Note: The exceptions are logged, I've omitted them for brevity.
SkippableLogRecord:
public class SkippableLogRecord implements WritableComparable<SkippableLogRecord> {
// fields
public SkippableLogRecord(Text line) {
readLine(line.toString());
}
private void readLine(String line) {
Matcher m = PATTERN.matcher(line);
boolean isMatchFound = m.matches() && m.groupCount() >= 5;
if (isMatchFound) {
try {
jvm = new Text(m.group("jvm"));
Calendar cal = getInstance();
cal.setTime(new SimpleDateFormat(DATE_FORMAT).parse(m
.group("date")));
day = new IntWritable(cal.get(DAY_OF_MONTH));
month = new IntWritable(cal.get(MONTH));
year = new IntWritable(cal.get(YEAR));
String p = decode(m.group("path"), UTF_8.name());
root = new Text(p.substring(1, p.indexOf(FILE_SEPARATOR, 1)));
filename = new Text(
p.substring(p.lastIndexOf(FILE_SEPARATOR) + 1));
path = new Text(p);
status = new IntWritable(Integer.parseInt(m.group("status")));
size = new LongWritable(Long.parseLong(m.group("size")));
} catch (ParseException | UnsupportedEncodingException e) {
isMatchFound = false;
}
}
public boolean isSkipped() {
return jvm == null;
}
@Override
public void readFields(DataInput in) throws IOException {
jvm.readFields(in);
day.readFields(in);
// more code
}
@Override
public void write(DataOutput out) throws IOException {
jvm.write(out);
day.write(out);
// more code
}
@Override
public int compareTo(SkippableLogRecord other) {...}
@Override
public boolean equals(Object obj) {...}
}
Mapper:
public class LogMapper extends
Mapper<LongWritable, Text, SkippableLogRecord, NullWritable> {
@Override
protected void map(LongWritable key, Text line, Context context) {
SkippableLogRecord rec = new SkippableLogRecord(line);
if (!rec.isSkipped()) {
try {
context.write(rec, NullWritable.get());
} catch (IOException | InterruptedException e) {...}
}
}
}
Reducer:
public class LogReducer extends
Reducer<SkippableLogRecord, NullWritable, DBRecord, NullWritable> {
@Override
protected void reduce(SkippableLogRecord rec,
Iterable<NullWritable> values, Context context) {
try {
context.write(new DBRecord(rec), NullWritable.get());
} catch (IOException | InterruptedException e) {...}
}
}
DBRecord:
public class DBRecord implements Writable, DBWritable {
// fields
public DBRecord(SkippableLogRecord logRecord) {
jvm = logRecord.getJvm().toString();
day = logRecord.getDay().get();
// more code for rest of the fields
}
@Override
public void readFields(ResultSet rs) throws SQLException {
jvm = rs.getString("jvm");
day = rs.getInt("day");
// more code for rest of the fields
}
@Override
public void write(PreparedStatement ps) throws SQLException {
ps.setString(1, jvm);
ps.setInt(2, day);
// more code for rest of the fields
}
}
Driver:
public class Driver extends Configured implements Tool {
@Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();
DBConfiguration.configureDB(conf, "com.mysql.jdbc.Driver", // driver
"jdbc:mysql://localhost:3306/aac", // db url
"***", // user name
"***"); // password
Job job = Job.getInstance(conf, "log-miner");
job.setJarByClass(getClass());
job.setMapperClass(LogMapper.class);
job.setReducerClass(LogReducer.class);
job.setMapOutputKeyClass(SkippableLogRecord.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(DBRecord.class);
job.setOutputValueClass(NullWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(DBOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
DBOutputFormat.setOutput(job, "log", // table name
new String[] { "jvm", "day", "month", "year", "root",
"filename", "path", "status", "size" } // table columns
);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
GenericOptionsParser parser = new GenericOptionsParser(
new Configuration(), args);
ToolRunner.run(new Driver(), parser.getRemainingArgs());
}
}
Job execution log:
15/02/28 02:17:58 INFO mapreduce.Job: map 100% reduce 100%
15/02/28 02:17:58 INFO mapreduce.Job: Job job_local166084441_0001 completed successfully
15/02/28 02:17:58 INFO mapreduce.Job: Counters: 35
File System Counters
FILE: Number of bytes read=37074
FILE: Number of bytes written=805438
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=476788498
HDFS: Number of bytes written=0
HDFS: Number of read operations=11
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Map-Reduce Framework
Map input records=482230
Map output records=0
Map output bytes=0
Map output materialized bytes=12
Input split bytes=210
Combine input records=0
Combine output records=0
Reduce input groups=0
Reduce shuffle bytes=12
Reduce input records=0
Reduce output records=0
Spilled Records=0
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=150
Total committed heap usage (bytes)=1381498880
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=171283337
File Output Format Counters
Bytes Written=0
To answer my own question, the issue was leading whitespaces which caused the matcher to fail. The unit test didn't test with leading whitespaces but the actual logs had those for some reason. Another issue with the code posted above was that all the fields in the class were initialized in the readLine
method. As @Anony-Mousse had mentioned, this is expensive because Hadoop datatypes are designed to be reused. It also caused a bigger problem with serialization and deserialization. When Hadoop tried to reconstruct the class by calling readFields
, it caused a NPE because all the fields were null. I also made other minor improvements using some Java 8 classes and syntax. In the end, even though I got it working, I rewrote the code using Spring Boot, Spring Data JPA and Spring's support for asynchronous processing using @Async
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.