简体   繁体   English

如何使Spark Streaming计算单元测试中文件中的单词?

[英]How can I make Spark Streaming count the words in a file in a unit test?

I've successfully built a very simple Spark Streaming application in Java that is based on the HdfsCount example in Scala . 我已经在Java中成功构建了一个非常简单的Spark Streaming应用程序,该应用程序基于Scala中HdfsCount示例

When I submit this application to my local Spark, it waits for a file to be written to a given directory, and when I create that file it successfully prints the number of words. 当我将此应用程序提交给我的本地Spark时,它会等待将文件写入给定目录,当我创建该文件时,它会成功打印出单词数。 I terminate the application by pressing Ctrl+C. 我按Ctrl + C终止应用程序。

Now I've tried to create a very basic unit test for this functionality, but in the test I was not able to print the same information, that is the number of words. 现在我已经尝试为这个功能创建一个非常基本的单元测试,但在测试中我无法打印相同的信息,即单词的数量。

What am I missing? 我错过了什么?

Below is the unit test file, and after that I've also included the code snippet that shows the countWords method: 下面是单元测试文件,之后我还包含了显示countWords方法的代码片段:

StarterAppTest.java StarterAppTest.java

import com.google.common.io.Files;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;


import org.junit.*;

import java.io.*;

public class StarterAppTest {

  JavaStreamingContext ssc;
  File tempDir;

  @Before
  public void setUp() {
    ssc = new JavaStreamingContext("local", "test", new Duration(3000));
    tempDir = Files.createTempDir();
    tempDir.deleteOnExit();
  }

  @After
  public void tearDown() {
    ssc.stop();
    ssc = null;
  }

  @Test
  public void testInitialization() {
    Assert.assertNotNull(ssc.sc());
  }


  @Test
  public void testCountWords() {

    StarterApp starterApp = new StarterApp();

    try {
      JavaDStream<String> lines = ssc.textFileStream(tempDir.getAbsolutePath());
      JavaPairDStream<String, Integer> wordCounts = starterApp.countWords(lines);

      ssc.start();

      File tmpFile = new File(tempDir.getAbsolutePath(), "tmp.txt");
      PrintWriter writer = new PrintWriter(tmpFile, "UTF-8");
      writer.println("8-Dec-2014: Emre Emre Emre Ergin Ergin Ergin");
      writer.close();

      System.err.println("===== Word Counts =======");
      wordCounts.print();
      System.err.println("===== Word Counts =======");

    } catch (FileNotFoundException e) {
      e.printStackTrace();
    } catch (UnsupportedEncodingException e) {
      e.printStackTrace();
    }


    Assert.assertTrue(true);

  }

}

This test compiles and starts to run, Spark Streaming prints a lot of diagnostic messages on the console but the call to wordCounts.print() does not print anything, whereas in StarterApp.java itself, they do. 这个测试编译并开始运行,Spark Streaming在控制台上打印了很多诊断消息,但对wordCounts.print()的调用不会打印任何内容,而在StarterApp.java本身,它们会打印。

I've also tried adding ssc.awaitTermination(); 我也尝试过添加ssc.awaitTermination(); after ssc.start() but nothing changed in that respect. ssc.start()但在这方面没有任何改变。 After that I've also tried to create a new file manually in the directory that this Spark Streaming application was checking but this time it gave an error. 之后我还试图在这个Spark Streaming应用程序正在检查的目录中手动创建一个新文件,但这次它给出了一个错误。

For completeness, below is the wordCounts method: 为完整起见,下面是wordCounts方法:

public JavaPairDStream<String, Integer> countWords(JavaDStream<String> lines) {
    JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
      @Override
      public Iterable<String> call(String x) { return Lists.newArrayList(SPACE.split(x)); }
    });

    JavaPairDStream<String, Integer> wordCounts = words.mapToPair(
            new PairFunction<String, String, Integer>() {
              @Override
              public Tuple2<String, Integer> call(String s) { return new Tuple2<>(s, 1); }
            }).reduceByKey((i1, i2) -> i1 + i2);

    return wordCounts;
  }

Few pointers: 几个指针:

  • Give at least 2 cores to SparkStreaming context. 为SparkStreaming上下文提供至少2个内核。 1 for the Streaming and 1 for the Spark processing. 1表示Streaming,1表示Spark处理。 "local" -> "local[2]" “本地” - >“本地[2]”
  • Your streaming interval is of 3000ms, so somewhere in your program you need to wait -at least- that time to expect an output. 您的流间隔时间为3000毫秒,因此您的程序中的某个位置需要等待 - 至少 - 这段时间才能获得输出。
  • Spark Streaming needs some time for the setup of listeners. Spark Streaming需要一些时间来设置侦听器。 The file is being created immediately after ssc.start is issued. 在发出ssc.start后立即创建该文件。 There's no warranty that the filesystem listener is already in place. 文件系统监听器已经到位,不保证。 I'd do some sleep(xx) after ssc.start 我会在ssc.start之后做一些sleep(xx)

In Streaming, it's all about the right timing. 在Streaming中,所有关于正确的时机。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM