[英]How can I make Spark Streaming count the words in a file in a unit test?
I've successfully built a very simple Spark Streaming application in Java that is based on the HdfsCount example in Scala . 我已经在Java中成功构建了一个非常简单的Spark Streaming应用程序,该应用程序基于Scala中的HdfsCount示例 。
When I submit this application to my local Spark, it waits for a file to be written to a given directory, and when I create that file it successfully prints the number of words. 当我将此应用程序提交给我的本地Spark时,它会等待将文件写入给定目录,当我创建该文件时,它会成功打印出单词数。 I terminate the application by pressing Ctrl+C.
我按Ctrl + C终止应用程序。
Now I've tried to create a very basic unit test for this functionality, but in the test I was not able to print the same information, that is the number of words. 现在我已经尝试为这个功能创建一个非常基本的单元测试,但在测试中我无法打印相同的信息,即单词的数量。
What am I missing? 我错过了什么?
Below is the unit test file, and after that I've also included the code snippet that shows the countWords method: 下面是单元测试文件,之后我还包含了显示countWords方法的代码片段:
import com.google.common.io.Files;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.junit.*;
import java.io.*;
public class StarterAppTest {
JavaStreamingContext ssc;
File tempDir;
@Before
public void setUp() {
ssc = new JavaStreamingContext("local", "test", new Duration(3000));
tempDir = Files.createTempDir();
tempDir.deleteOnExit();
}
@After
public void tearDown() {
ssc.stop();
ssc = null;
}
@Test
public void testInitialization() {
Assert.assertNotNull(ssc.sc());
}
@Test
public void testCountWords() {
StarterApp starterApp = new StarterApp();
try {
JavaDStream<String> lines = ssc.textFileStream(tempDir.getAbsolutePath());
JavaPairDStream<String, Integer> wordCounts = starterApp.countWords(lines);
ssc.start();
File tmpFile = new File(tempDir.getAbsolutePath(), "tmp.txt");
PrintWriter writer = new PrintWriter(tmpFile, "UTF-8");
writer.println("8-Dec-2014: Emre Emre Emre Ergin Ergin Ergin");
writer.close();
System.err.println("===== Word Counts =======");
wordCounts.print();
System.err.println("===== Word Counts =======");
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
Assert.assertTrue(true);
}
}
This test compiles and starts to run, Spark Streaming prints a lot of diagnostic messages on the console but the call to wordCounts.print()
does not print anything, whereas in StarterApp.java itself, they do. 这个测试编译并开始运行,Spark Streaming在控制台上打印了很多诊断消息,但对
wordCounts.print()
的调用不会打印任何内容,而在StarterApp.java本身,它们会打印。
I've also tried adding ssc.awaitTermination();
我也尝试过添加
ssc.awaitTermination();
after ssc.start()
but nothing changed in that respect. 在
ssc.start()
但在这方面没有任何改变。 After that I've also tried to create a new file manually in the directory that this Spark Streaming application was checking but this time it gave an error. 之后我还试图在这个Spark Streaming应用程序正在检查的目录中手动创建一个新文件,但这次它给出了一个错误。
For completeness, below is the wordCounts method: 为完整起见,下面是wordCounts方法:
public JavaPairDStream<String, Integer> countWords(JavaDStream<String> lines) {
JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
@Override
public Iterable<String> call(String x) { return Lists.newArrayList(SPACE.split(x)); }
});
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(
new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String s) { return new Tuple2<>(s, 1); }
}).reduceByKey((i1, i2) -> i1 + i2);
return wordCounts;
}
Few pointers: 几个指针:
ssc.start
is issued. ssc.start
后立即创建该文件。 There's no warranty that the filesystem listener is already in place. sleep(xx)
after ssc.start
ssc.start
之后做一些sleep(xx)
In Streaming, it's all about the right timing. 在Streaming中,所有关于正确的时机。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.