[英]Calculate number of rows in a window and the size of window in bytes - Spark Streaming
I am confronted with a problem I would like with spark count the number of rows that I receive on a time window and the total number of bytes of these lines at the end of each window of time. 我遇到一个问题,我想用火花计数在一个时间窗口上收到的行数以及在每个时间窗口结束时这些行的字节总数。
On the other hand, my code only counts on each line and not globally. 另一方面,我的代码仅在每一行计数,而不是全局计数。 Can someone tell me what is wrong in my code? 有人可以告诉我代码中的错误吗?
public class SocketDriver implements Serializable {
private static final Pattern BACKSLASH = Pattern.compile("\n");
public static void main(String[] args) throws Exception {
if (args.length < 2) {
System.err.println("Usage: SocketDriver <hostname> <port>");
System.exit(1);
}
final String hostname = args[0];
final int port = Integer.parseInt(args[1]);
final String appName = "SocketDriver";
final String master = "local[2]";
final Duration batchDuration = Durations.seconds(1);
final Duration windowDuration = Durations.seconds(30);
final Duration slideDuration = Durations.seconds(3);
final String checkpointDirectory = Files.createTempDirectory(appName).toString();
SparkConf sparkConf = new SparkConf()
.setAppName(appName)
.setMaster(master);
JavaStreamingContext streamingContext = new JavaStreamingContext(sparkConf, batchDuration);
streamingContext.checkpoint(checkpointDirectory);
JavaReceiverInputDStream<String> lines = streamingContext.socketTextStream(hostname, port, StorageLevels.MEMORY_AND_DISK_SER);
JavaDStream<String> words = lines.flatMap(word -> Arrays.asList(BACKSLASH.split(word)).iterator());
words.window(windowDuration, slideDuration).foreachRDD((VoidFunction<JavaRDD<String>>)
rdd -> rdd.foreach((VoidFunction<String>)
line -> {
double bytes = 0;
int sum = 0;
double frequency = 0.0;
sum += 1;
bytes += line.getBytes().length;
frequency += bytes / sum;
System.out.println("windowDuration: " + windowDuration.milliseconds() / 1000 + " seconds " + " : " + "slideDuration: " + slideDuration.milliseconds() / 1000 + " seconds " + " : " +
"total messages : " + sum + " total bytes : " + bytes + " frequency : " + frequency);
})
);
words.countByWindow(windowDuration, slideDuration).print();
streamingContext.start();
streamingContext.awaitTerminationOrTimeout(60000);
streamingContext.stop();
}
}
The problem lies in the first statement of the following: 问题在于以下内容的第一条陈述:
words.window(windowDuration, slideDuration).foreachRDD...
words.countByWindow(windowDuration, slideDuration).print();
The problem is you're resetting the values of sum of bytes for every line. 问题是您要重置每行的字节总和值。 This is giving you the number of bytes in a single line as mentioned in the question. 这就是问题中提到的单行中的字节数。
You can achieve the desired functionality by replacing the above two statements with the following: 通过将以下两个语句替换为以下内容,可以实现所需的功能:
//counts will have elements of the form (1, numberOfBytesInALine)
JavaPairDStream<Integer, Integer> counts = words.mapToPair(new PairFunction<String, Integer, Integer>() {
@Override
public Tuple2<Integer, Integer> call(final String line) {
return new Tuple2<Integer, Integer>(1, line.getBytes().length));
}
});
//countOfWindow will have single element of the form (totalNumberOfLines, totalNumberOfBytes)
JavaDStream<Tuple2<Integer, Integer>> countOfWindow = counts.reduceByWindow(new Function2<Tuple2<Integer, Integer>,Tuple2<Integer, Integer>, Tuple2<Integer, Integer>> () {
@Override
public Tuple2<Integer, Integer> call(final Tuple2<Integer, Integer> a , final Tuple2<Integer, Integer> b) {
return new Tuple2<Integer, Integer>(a._1 + b._1, a._2 + b._2));
}
}
,windowDuration,slideDuration);
countOfWindow.print();
The trick was to convert each line to integer 1 and number of bytes in that line. 诀窍是将每一行转换为整数1以及该行中的字节数。 After that when we reduce it the 1s will sum up to the number of lines and on the other hand number of bytes per line will sum up to the total number of bytes. 在那之后,当我们减少它时,1将总计到行数,另一方面,每行的字节数将总计到字节总数。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.