简体   繁体   English

在Windows系统中打印流的内容(火花流)

[英]Print the content of streams (Spark streaming) in Windows system

I want just to print the content of streams to console. 我只想将流的内容打印到控制台。 I wrote the following code but it does not print anything. 我编写了以下代码,但未打印任何内容。 Anyone can help me to read text file as stream in Spark?? 任何人都可以帮助我在Spark中以流形式读取文本文件吗? Is there a problem related to Windows system? Windows系统有问题吗?

public static void main(String[] args) throws Exception {

     SparkConf sparkConf = new SparkConf().setAppName("My app")
        .setMaster("local[2]")
        .setSparkHome("C:\\Spark\\spark-1.5.1-bin-hadoop2.6")
        .set("spark.executor.memory", "2g");

    JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));

    JavaDStream<String> dataStream = jssc.textFileStream("C://testStream//copy.csv");
    dataStream.print();

    jssc.start();
    jssc.awaitTermination();
}

UPDATE: The content of copy.csv is 更新:copy.csv的内容是

0,0,12,5,0
0,0,12,5,0
0,1,2,0,42
0,0,0,0,264
0,0,12,5,0

textFileStream is for Monitoring the hadoop Compatible Directories. textFileStream用于监视hadoop兼容目录。 This operation will watch the provided directory and as you add new files in the provided directory it will read/ stream the data from the newly added files. 此操作将监视提供的目录,并且在将新文件添加到提供的目录中时,将从新添加的文件中读取/传输数据。

You cannot read text/ csv files using textFileStream or rather I would say that you do not need streaming in case you are just reading the files. 您无法使用textFileStream读取text / csv文件,或者我想说您不需要流传输,以防您只是在读取文件。

My Suggestion would be to monitor some directory (may be HDFS or local file system) and then add files and capture the content of these new files using textFileStream . 我的建议是监视某个目录(可以是HDFS或本地文件系统),然后添加文件并使用textFileStream捕获这些新文件的内容。

May be in your code may be you can replace "C://testStream//copy.csv" with C://testStream" and once your Spark Streaming job is up and running then add file copy.csv to C://testStream folder and see the output on Spark Console. 可能在您的代码中,可能是您可以将"C://testStream//copy.csv"替换为C://testStream"并且在Spark Streaming作业启动并运行后,将文件copy.csv添加到C://testStream文件夹,并在Spark Console上查看输出。

OR 要么

may be you can write another command line Scala/ Java program which read the files and throw the content over the Socket (at a certain PORT#) and next you can leverage socketTextStream for capturing and reading the data. 可能是您可以编写另一个命令行Scala / Java程序,该程序读取文件并通过Socket(在某个PORT#上)将内容抛出,然后您可以利用socketTextStream捕获和读取数据。 Once you have read the data, you further apply other transformation or output operations. 读取数据后,您将进一步应用其他转换或输出操作。

You can also think of leveraging Flume too 您也可以考虑利用Flume

Refer to API Documentation for more details 有关更多详细信息,请参阅API文档

This worked for me on Windows 7 and Spark 1.6.3: (removing the rest of code, important one is how to define the folder to monitor) 这在Windows 7和Spark 1.6.3上对我有用:(删除其余代码,重要的是如何定义要监视的文件夹)

val ssc = ...
val lines = ssc.textFileStream("file:///D:/tmp/data")
...
print 

... ...

This monitors directory D:/tmp/data, ssc is my streaming context 这监视目录D:/ tmp / data,ssc是我的流上下文

Steps: 脚步:

  1. Create a file say 1.txt in D:/tmp/data 在D:/ tmp / data中创建一个说1.txt的文件
  2. Enter some text 输入一些文字
  3. Start the spart application 启动spart应用程序
  4. Rename the file to data.txt (i believe any arbitrary name will do as long as it's changed while directory is monitored by spark) 将文件重命名为data.txt(我相信只要在Spark监视目录时更改名称,任何名称都可以)

One other thing I noticed is that I had to change the line separator to Unix style (used Notepad++) otherwise file wasn't getting picked up. 我注意到的另一件事是,我不得不将行分隔符更改为Unix样式(使用Notepad ++),否则文件不会被拾取。

试试下面的代码,它的工作原理:

JavaDStream<String> dataStream = jssc.textFileStream("file:///C:/testStream/");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM