简体   繁体   中英

Calculate number of rows in a window and the size of window in bytes - Spark Streaming

I am confronted with a problem I would like with spark count the number of rows that I receive on a time window and the total number of bytes of these lines at the end of each window of time.

On the other hand, my code only counts on each line and not globally. Can someone tell me what is wrong in my code?

public class SocketDriver implements Serializable {

private static final Pattern BACKSLASH = Pattern.compile("\n");

public static void main(String[] args) throws Exception {

    if (args.length < 2) {
        System.err.println("Usage: SocketDriver <hostname> <port>");
        System.exit(1);
    }

    final String hostname = args[0];
    final int port = Integer.parseInt(args[1]);

    final String appName = "SocketDriver";
    final String master = "local[2]";

    final Duration batchDuration = Durations.seconds(1);
    final Duration windowDuration = Durations.seconds(30);
    final Duration slideDuration = Durations.seconds(3);
    final String checkpointDirectory = Files.createTempDirectory(appName).toString();


    SparkConf sparkConf = new SparkConf()
                                    .setAppName(appName)
                                    .setMaster(master);

    JavaStreamingContext streamingContext = new JavaStreamingContext(sparkConf, batchDuration);
    streamingContext.checkpoint(checkpointDirectory);

    JavaReceiverInputDStream<String> lines = streamingContext.socketTextStream(hostname, port, StorageLevels.MEMORY_AND_DISK_SER);

    JavaDStream<String> words = lines.flatMap(word -> Arrays.asList(BACKSLASH.split(word)).iterator());

    words.window(windowDuration, slideDuration).foreachRDD((VoidFunction<JavaRDD<String>>)
            rdd -> rdd.foreach((VoidFunction<String>)
                    line -> {
                        double bytes = 0;
                        int sum = 0;
                        double frequency = 0.0;
                        sum += 1;
                        bytes += line.getBytes().length;
                        frequency += bytes / sum;

                        System.out.println("windowDuration: " + windowDuration.milliseconds() / 1000 + " seconds " + " : " + "slideDuration: " + slideDuration.milliseconds() / 1000 + " seconds " + " : " +
                                "total messages : " + sum + " total bytes : " + bytes + " frequency : " + frequency);
                    })
    );

    words.countByWindow(windowDuration, slideDuration).print();



    streamingContext.start();
    streamingContext.awaitTerminationOrTimeout(60000);
    streamingContext.stop();
 }


}

The problem lies in the first statement of the following:

  1. words.window(windowDuration, slideDuration).foreachRDD...

  2. words.countByWindow(windowDuration, slideDuration).print();

The problem is you're resetting the values of sum of bytes for every line. This is giving you the number of bytes in a single line as mentioned in the question.

You can achieve the desired functionality by replacing the above two statements with the following:

//counts will have elements of the form (1, numberOfBytesInALine)    
JavaPairDStream<Integer, Integer> counts = words.mapToPair(new PairFunction<String, Integer, Integer>() {
    @Override
    public Tuple2<Integer, Integer> call(final String line) {
        return new Tuple2<Integer, Integer>(1, line.getBytes().length));
    }
});

//countOfWindow will have single element of the form (totalNumberOfLines, totalNumberOfBytes)
JavaDStream<Tuple2<Integer, Integer>> countOfWindow = counts.reduceByWindow(new Function2<Tuple2<Integer, Integer>,Tuple2<Integer, Integer>, Tuple2<Integer, Integer>> () {
    @Override
    public Tuple2<Integer, Integer> call(final Tuple2<Integer, Integer> a , final Tuple2<Integer, Integer> b) {
        return new Tuple2<Integer, Integer>(a._1 + b._1,  a._2 + b._2));
    }
}
,windowDuration,slideDuration);
countOfWindow.print();

The trick was to convert each line to integer 1 and number of bytes in that line. After that when we reduce it the 1s will sum up to the number of lines and on the other hand number of bytes per line will sum up to the total number of bytes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM