[英]Calculate number of rows in a window and the size of window in bytes - Spark Streaming
我遇到一個問題,我想用火花計數在一個時間窗口上收到的行數以及在每個時間窗口結束時這些行的字節總數。
另一方面,我的代碼僅在每一行計數,而不是全局計數。 有人可以告訴我代碼中的錯誤嗎?
public class SocketDriver implements Serializable {
private static final Pattern BACKSLASH = Pattern.compile("\n");
public static void main(String[] args) throws Exception {
if (args.length < 2) {
System.err.println("Usage: SocketDriver <hostname> <port>");
System.exit(1);
}
final String hostname = args[0];
final int port = Integer.parseInt(args[1]);
final String appName = "SocketDriver";
final String master = "local[2]";
final Duration batchDuration = Durations.seconds(1);
final Duration windowDuration = Durations.seconds(30);
final Duration slideDuration = Durations.seconds(3);
final String checkpointDirectory = Files.createTempDirectory(appName).toString();
SparkConf sparkConf = new SparkConf()
.setAppName(appName)
.setMaster(master);
JavaStreamingContext streamingContext = new JavaStreamingContext(sparkConf, batchDuration);
streamingContext.checkpoint(checkpointDirectory);
JavaReceiverInputDStream<String> lines = streamingContext.socketTextStream(hostname, port, StorageLevels.MEMORY_AND_DISK_SER);
JavaDStream<String> words = lines.flatMap(word -> Arrays.asList(BACKSLASH.split(word)).iterator());
words.window(windowDuration, slideDuration).foreachRDD((VoidFunction<JavaRDD<String>>)
rdd -> rdd.foreach((VoidFunction<String>)
line -> {
double bytes = 0;
int sum = 0;
double frequency = 0.0;
sum += 1;
bytes += line.getBytes().length;
frequency += bytes / sum;
System.out.println("windowDuration: " + windowDuration.milliseconds() / 1000 + " seconds " + " : " + "slideDuration: " + slideDuration.milliseconds() / 1000 + " seconds " + " : " +
"total messages : " + sum + " total bytes : " + bytes + " frequency : " + frequency);
})
);
words.countByWindow(windowDuration, slideDuration).print();
streamingContext.start();
streamingContext.awaitTerminationOrTimeout(60000);
streamingContext.stop();
}
}
問題在於以下內容的第一條陳述:
words.window(windowDuration, slideDuration).foreachRDD...
words.countByWindow(windowDuration, slideDuration).print();
問題是您要重置每行的字節總和值。 這就是問題中提到的單行中的字節數。
通過將以下兩個語句替換為以下內容,可以實現所需的功能:
//counts will have elements of the form (1, numberOfBytesInALine)
JavaPairDStream<Integer, Integer> counts = words.mapToPair(new PairFunction<String, Integer, Integer>() {
@Override
public Tuple2<Integer, Integer> call(final String line) {
return new Tuple2<Integer, Integer>(1, line.getBytes().length));
}
});
//countOfWindow will have single element of the form (totalNumberOfLines, totalNumberOfBytes)
JavaDStream<Tuple2<Integer, Integer>> countOfWindow = counts.reduceByWindow(new Function2<Tuple2<Integer, Integer>,Tuple2<Integer, Integer>, Tuple2<Integer, Integer>> () {
@Override
public Tuple2<Integer, Integer> call(final Tuple2<Integer, Integer> a , final Tuple2<Integer, Integer> b) {
return new Tuple2<Integer, Integer>(a._1 + b._1, a._2 + b._2));
}
}
,windowDuration,slideDuration);
countOfWindow.print();
訣竅是將每一行轉換為整數1以及該行中的字節數。 在那之后,當我們減少它時,1將總計到行數,另一方面,每行的字節數將總計到字節總數。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.