I parsed .log
file to JavaRDD, after sorted this JavaRDD and now I have, for example oldJavaRDD
:
2016-03-28 | 11:00 | X | object1 | region1
2016-03-28 | 11:01 | Y | object1 | region1
2016-03-28 | 11:05 | X | object1 | region1
2016-03-28 | 11:09 | X | object1 | region1
2016-03-28 | 11:00 | X | object2 | region1
2016-03-28 | 11:01 | Z | object2 | region1
How I can get newJavaRDD
for saving it to DB?
New JavaRDD structure have to be:
2016-03-28 | 9 | object1 | region1
2016-03-28 | 1 | object2 | region1
so, I have to calculate time between current and previous row (also use flag X, Y, Z
in some cases to define, add time to result or not) and add new element to JavaRDD after changing date, objectName
or objectRegion
.
I can do it using this type of code ( map ), but I think it's not good and not the fastest way
JavaRDD<NewObject> newJavaRDD = oldJavaRDD.map { r ->
String datePrev[] = ...
if (datePrev != dateCurr ...) {
return newJavaRdd;
} else {
return null;
}
}
First, your code example references newJavaRDD
from within a transformation that creates newJavaRDD
- that's impossible on a few different levels:
So, how should you do that?
Assuming :
date
+ object
+ region
groupBy
these fields as key You can groupBy
the key fields, and then mapValues
to get the "minute distnace" between first and last record (the function passed to mapValues
can be changed to contain your exact logic if I didn't get it right). I'll use Joda Time library for the time calculations:
public static void main(String[] args) {
// some setup code for this test:
JavaSparkContext sc = new JavaSparkContext("local", "test");
// input:
final JavaRDD<String[]> input = sc.parallelize(Lists.newArrayList(
// date time ? object region
new String[]{"2016-03-28", "11:00", "X", "object1", "region1"},
new String[]{"2016-03-28", "11:01", "Y", "object1", "region1"},
new String[]{"2016-03-28", "11:05", "X", "object1", "region1"},
new String[]{"2016-03-28", "11:09", "X", "object1", "region1"},
new String[]{"2016-03-28", "11:00", "X", "object2", "region1"},
new String[]{"2016-03-28", "11:01", "Z", "object2", "region1"}
));
// grouping by key:
final JavaPairRDD<String, Iterable<String[]>> byObjectAndDate = input.groupBy(new Function<String[], String>() {
@Override
public String call(String[] record) throws Exception {
return record[0] + record[3] + record[4]; // date, object, region
}
});
// mapping each "value" (all record matching key) to result
final JavaRDD<String[]> result = byObjectAndDate.mapValues(new Function<Iterable<String[]>, String[]>() {
@Override
public String[] call(Iterable<String[]> records) throws Exception {
final Iterator<String[]> iterator = records.iterator();
String[] previousRecord = iterator.next();
int diffMinutes = 0;
for (String[] record : records) {
if (record[2].equals("X")) { // if I got your intention right...
final LocalDateTime prev = getLocalDateTime(previousRecord);
final LocalDateTime curr = getLocalDateTime(record);
diffMinutes += Period.fieldDifference(prev, curr).toStandardMinutes().getMinutes();
}
previousRecord = record;
}
return new String[]{
previousRecord[0],
Integer.toString(diffMinutes),
previousRecord[3],
previousRecord[4]
};
}
}).values();
// do whatever with "result"...
}
// extracts a Joda LocalDateTime from a "record"
static LocalDateTime getLocalDateTime(String[] record) {
return LocalDateTime.parse(record[0] + " " + record[1], formatter);
}
static final DateTimeFormatter formatter = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm");
PS In Scala this would take about 8 lines... :/
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.