简体   繁体   English

计算窗口中的行数和窗口大小(以字节为单位)-Spark Streaming

[英]Calculate number of rows in a window and the size of window in bytes - Spark Streaming

I am confronted with a problem I would like with spark count the number of rows that I receive on a time window and the total number of bytes of these lines at the end of each window of time. 我遇到一个问题,我想用火花计数在一个时间窗口上收到的行数以及在每个时间窗口结束时这些行的字节总数。

On the other hand, my code only counts on each line and not globally. 另一方面,我的代码仅在每一行计数,而不是全局计数。 Can someone tell me what is wrong in my code? 有人可以告诉我代码中的错误吗?

public class SocketDriver implements Serializable {

private static final Pattern BACKSLASH = Pattern.compile("\n");

public static void main(String[] args) throws Exception {

    if (args.length < 2) {
        System.err.println("Usage: SocketDriver <hostname> <port>");
        System.exit(1);
    }

    final String hostname = args[0];
    final int port = Integer.parseInt(args[1]);

    final String appName = "SocketDriver";
    final String master = "local[2]";

    final Duration batchDuration = Durations.seconds(1);
    final Duration windowDuration = Durations.seconds(30);
    final Duration slideDuration = Durations.seconds(3);
    final String checkpointDirectory = Files.createTempDirectory(appName).toString();


    SparkConf sparkConf = new SparkConf()
                                    .setAppName(appName)
                                    .setMaster(master);

    JavaStreamingContext streamingContext = new JavaStreamingContext(sparkConf, batchDuration);
    streamingContext.checkpoint(checkpointDirectory);

    JavaReceiverInputDStream<String> lines = streamingContext.socketTextStream(hostname, port, StorageLevels.MEMORY_AND_DISK_SER);

    JavaDStream<String> words = lines.flatMap(word -> Arrays.asList(BACKSLASH.split(word)).iterator());

    words.window(windowDuration, slideDuration).foreachRDD((VoidFunction<JavaRDD<String>>)
            rdd -> rdd.foreach((VoidFunction<String>)
                    line -> {
                        double bytes = 0;
                        int sum = 0;
                        double frequency = 0.0;
                        sum += 1;
                        bytes += line.getBytes().length;
                        frequency += bytes / sum;

                        System.out.println("windowDuration: " + windowDuration.milliseconds() / 1000 + " seconds " + " : " + "slideDuration: " + slideDuration.milliseconds() / 1000 + " seconds " + " : " +
                                "total messages : " + sum + " total bytes : " + bytes + " frequency : " + frequency);
                    })
    );

    words.countByWindow(windowDuration, slideDuration).print();



    streamingContext.start();
    streamingContext.awaitTerminationOrTimeout(60000);
    streamingContext.stop();
 }


}

The problem lies in the first statement of the following: 问题在于以下内容的第一条陈述:

  1. words.window(windowDuration, slideDuration).foreachRDD...

  2. words.countByWindow(windowDuration, slideDuration).print();

The problem is you're resetting the values of sum of bytes for every line. 问题是您要重置每行的字节总和值。 This is giving you the number of bytes in a single line as mentioned in the question. 这就是问题中提到的单行中的字节数。

You can achieve the desired functionality by replacing the above two statements with the following: 通过将以下两个语句替换为以下内容,可以实现所需的功能:

//counts will have elements of the form (1, numberOfBytesInALine)    
JavaPairDStream<Integer, Integer> counts = words.mapToPair(new PairFunction<String, Integer, Integer>() {
    @Override
    public Tuple2<Integer, Integer> call(final String line) {
        return new Tuple2<Integer, Integer>(1, line.getBytes().length));
    }
});

//countOfWindow will have single element of the form (totalNumberOfLines, totalNumberOfBytes)
JavaDStream<Tuple2<Integer, Integer>> countOfWindow = counts.reduceByWindow(new Function2<Tuple2<Integer, Integer>,Tuple2<Integer, Integer>, Tuple2<Integer, Integer>> () {
    @Override
    public Tuple2<Integer, Integer> call(final Tuple2<Integer, Integer> a , final Tuple2<Integer, Integer> b) {
        return new Tuple2<Integer, Integer>(a._1 + b._1,  a._2 + b._2));
    }
}
,windowDuration,slideDuration);
countOfWindow.print();

The trick was to convert each line to integer 1 and number of bytes in that line. 诀窍是将每一行转换为整数1以及该行中的字节数。 After that when we reduce it the 1s will sum up to the number of lines and on the other hand number of bytes per line will sum up to the total number of bytes. 在那之后,当我们减少它时,1将总计到行数,另一方面,每行的字节数将总计到字节总数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 获取Spark的Streaming窗口时间戳 - Getting Streaming window timestamp for spark Spark流在窗口上保持状态 - Spark streaming maintain state over window 我们可以在Spark Streaming中创建基于“记录计数”的窗口吗? - Can we create “record count” based window in spark streaming? Spark Structured Streaming - 在有状态 stream 处理中使用 Window 操作进行事件处理 - Spark Structured Streaming - Event processing with Window operation in stateful stream processing Spark Streaming:写入从Kafka主题读取的行数 - Spark Streaming: Writing number of rows read from a Kafka topic 如何计算BitSet的大小(以字节为单位)? - How to calculate the size of a BitSet in bytes? 应用程序工作正常,所以我可以忽略“CursorWindow:窗口已满:请求分配 12 字节,可用空间 4 字节,窗口大小 2,097,152 字节”? - App works fine, so can I ignore "CursorWindow: Window is full: requested allocation 12 bytes, free space 4 bytes, window size 2,097,152 bytes"? 在正在运行的Spark Streaming作业中修改窗口长度或动态创建多个Windows - Modify Window length or Creating Multiple Windows dynamically in a running Spark Streaming Job Apache Flink Streaming窗口WordCount - Apache Flink Streaming window WordCount 限制Java中的窗口大小 - Limiting window size in Java
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM