[英]Calculate number of rows in a window and the size of window in bytes - Spark Streaming
我遇到一个问题,我想用火花计数在一个时间窗口上收到的行数以及在每个时间窗口结束时这些行的字节总数。
另一方面,我的代码仅在每一行计数,而不是全局计数。 有人可以告诉我代码中的错误吗?
public class SocketDriver implements Serializable {
private static final Pattern BACKSLASH = Pattern.compile("\n");
public static void main(String[] args) throws Exception {
if (args.length < 2) {
System.err.println("Usage: SocketDriver <hostname> <port>");
System.exit(1);
}
final String hostname = args[0];
final int port = Integer.parseInt(args[1]);
final String appName = "SocketDriver";
final String master = "local[2]";
final Duration batchDuration = Durations.seconds(1);
final Duration windowDuration = Durations.seconds(30);
final Duration slideDuration = Durations.seconds(3);
final String checkpointDirectory = Files.createTempDirectory(appName).toString();
SparkConf sparkConf = new SparkConf()
.setAppName(appName)
.setMaster(master);
JavaStreamingContext streamingContext = new JavaStreamingContext(sparkConf, batchDuration);
streamingContext.checkpoint(checkpointDirectory);
JavaReceiverInputDStream<String> lines = streamingContext.socketTextStream(hostname, port, StorageLevels.MEMORY_AND_DISK_SER);
JavaDStream<String> words = lines.flatMap(word -> Arrays.asList(BACKSLASH.split(word)).iterator());
words.window(windowDuration, slideDuration).foreachRDD((VoidFunction<JavaRDD<String>>)
rdd -> rdd.foreach((VoidFunction<String>)
line -> {
double bytes = 0;
int sum = 0;
double frequency = 0.0;
sum += 1;
bytes += line.getBytes().length;
frequency += bytes / sum;
System.out.println("windowDuration: " + windowDuration.milliseconds() / 1000 + " seconds " + " : " + "slideDuration: " + slideDuration.milliseconds() / 1000 + " seconds " + " : " +
"total messages : " + sum + " total bytes : " + bytes + " frequency : " + frequency);
})
);
words.countByWindow(windowDuration, slideDuration).print();
streamingContext.start();
streamingContext.awaitTerminationOrTimeout(60000);
streamingContext.stop();
}
}
问题在于以下内容的第一条陈述:
words.window(windowDuration, slideDuration).foreachRDD...
words.countByWindow(windowDuration, slideDuration).print();
问题是您要重置每行的字节总和值。 这就是问题中提到的单行中的字节数。
通过将以下两个语句替换为以下内容,可以实现所需的功能:
//counts will have elements of the form (1, numberOfBytesInALine)
JavaPairDStream<Integer, Integer> counts = words.mapToPair(new PairFunction<String, Integer, Integer>() {
@Override
public Tuple2<Integer, Integer> call(final String line) {
return new Tuple2<Integer, Integer>(1, line.getBytes().length));
}
});
//countOfWindow will have single element of the form (totalNumberOfLines, totalNumberOfBytes)
JavaDStream<Tuple2<Integer, Integer>> countOfWindow = counts.reduceByWindow(new Function2<Tuple2<Integer, Integer>,Tuple2<Integer, Integer>, Tuple2<Integer, Integer>> () {
@Override
public Tuple2<Integer, Integer> call(final Tuple2<Integer, Integer> a , final Tuple2<Integer, Integer> b) {
return new Tuple2<Integer, Integer>(a._1 + b._1, a._2 + b._2));
}
}
,windowDuration,slideDuration);
countOfWindow.print();
诀窍是将每一行转换为整数1以及该行中的字节数。 在那之后,当我们减少它时,1将总计到行数,另一方面,每行的字节数将总计到字节总数。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.