简体   繁体   中英

How to calculate difference between current and previous row in Spark JavaRDD

I parsed .log file to JavaRDD, after sorted this JavaRDD and now I have, for example oldJavaRDD :
2016-03-28 | 11:00 | X | object1 | region1
2016-03-28 | 11:01 | Y | object1 | region1
2016-03-28 | 11:05 | X | object1 | region1
2016-03-28 | 11:09 | X | object1 | region1
2016-03-28 | 11:00 | X | object2 | region1
2016-03-28 | 11:01 | Z | object2 | region1

How I can get newJavaRDD for saving it to DB?
New JavaRDD structure have to be:
2016-03-28 | 9 | object1 | region1
2016-03-28 | 1 | object2 | region1
so, I have to calculate time between current and previous row (also use flag X, Y, Z in some cases to define, add time to result or not) and add new element to JavaRDD after changing date, objectName or objectRegion .

I can do it using this type of code ( map ), but I think it's not good and not the fastest way

    JavaRDD<NewObject> newJavaRDD = oldJavaRDD.map { r -> 
      String datePrev[] = ...
        if (datePrev != dateCurr ...) {
          return newJavaRdd;
        } else {
          return null;
        }
    }

First, your code example references newJavaRDD from within a transformation that creates newJavaRDD - that's impossible on a few different levels:

  • You can't reference a variable on the right-hand-side of that variable's declaration...
  • You can't use an RDD within a transformation on an RDD (same one or another one - that doesn't matter) - anything inside a transformation must be serialized by Spark, and Spark can't serialize its own RDDs (that would make no sense)

So, how should you do that?

Assuming :

  1. Your intention here is to get a single record for each combination of date + object + region
  2. There shouldn't be too many records for each such combination, so it's safe to groupBy these fields as key

You can groupBy the key fields, and then mapValues to get the "minute distnace" between first and last record (the function passed to mapValues can be changed to contain your exact logic if I didn't get it right). I'll use Joda Time library for the time calculations:

public static void main(String[] args) {
    // some setup code for this test:
    JavaSparkContext sc = new JavaSparkContext("local", "test");

    // input:
    final JavaRDD<String[]> input = sc.parallelize(Lists.newArrayList(
            //              date        time     ?    object     region
            new String[]{"2016-03-28", "11:00", "X", "object1", "region1"},
            new String[]{"2016-03-28", "11:01", "Y", "object1", "region1"},
            new String[]{"2016-03-28", "11:05", "X", "object1", "region1"},
            new String[]{"2016-03-28", "11:09", "X", "object1", "region1"},
            new String[]{"2016-03-28", "11:00", "X", "object2", "region1"},
            new String[]{"2016-03-28", "11:01", "Z", "object2", "region1"}
    ));

    // grouping by key:
    final JavaPairRDD<String, Iterable<String[]>> byObjectAndDate = input.groupBy(new Function<String[], String>() {
        @Override
        public String call(String[] record) throws Exception {
            return record[0] + record[3] + record[4]; // date, object, region
        }
    });

    // mapping each "value" (all record matching key) to result
    final JavaRDD<String[]> result = byObjectAndDate.mapValues(new Function<Iterable<String[]>, String[]>() {
        @Override
        public String[] call(Iterable<String[]> records) throws Exception {
            final Iterator<String[]> iterator = records.iterator();
            String[] previousRecord = iterator.next();
            int diffMinutes = 0;

            for (String[] record : records) {
                if (record[2].equals("X")) {  // if I got your intention right...
                    final LocalDateTime prev = getLocalDateTime(previousRecord);
                    final LocalDateTime curr = getLocalDateTime(record);
                    diffMinutes += Period.fieldDifference(prev, curr).toStandardMinutes().getMinutes();
                }
                previousRecord = record;
            }

            return new String[]{
                    previousRecord[0],
                    Integer.toString(diffMinutes),
                    previousRecord[3],
                    previousRecord[4]
            };
        }
    }).values();

    // do whatever with "result"...
}

// extracts a Joda LocalDateTime from a "record"
static LocalDateTime getLocalDateTime(String[] record) {
    return LocalDateTime.parse(record[0] + " " + record[1], formatter);
}

static final DateTimeFormatter formatter = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm");

PS In Scala this would take about 8 lines... :/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM