簡體   English   中英

無法使用Spark Streaming將數據從TCP端口寫入HDFS

[英]Unable to write data from TCP port to HDFS using Spark Streaming

我正在嘗試從TCP端口流式傳輸數據,並使用Spark-Streaming將數據加載到HDFS中。

正在HDFS中創建文件,但它們都是空的。 但是,Spark Streaming控制台顯示從TCP端口讀取數據。

我在CDH-5中使用Scala-Shell在Spark 0.9.0、0.9.1和1.0中進行了嘗試。 我在另一個終端上做了一個“ nc -lk 9993”來傳輸數據。

下面是代碼,請讓我們知道如何解決此問題。 謝謝。

import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.api.java.function._
import org.apache.spark.streaming._
import org.apache.spark.streaming.api._
import org.apache.spark.streaming.StreamingContext._
import StreamingContext._

val ssc8 = new StreamingContext("local", "NetworkWordCount", Seconds(1))
val lines8 = ssc8.socketTextStream("localhost", 9993)

val words8 = lines8.flatMap(_.split(" "))


val pairs8 = words8.map(word => (word, 1))
val wordCounts8 = pairs8.reduceByKey(_ + _)

wordCounts8.saveAsTextFiles("hdfs://Node1:8020/user/root/Spark8")

wordCounts8.print()

ssc8.start() 

附加---------------------------------------

我在下面提供了日志和HDFS文件-

HDFS  Output Files
--------------------

-rw-r--r--   3 user1 user1          0 2014-06-26 09:19 /user/user1/SparkV/_SUCCESS
-rw-r--r--   3 user1 user1         0 2014-06-26 09:19 /user/user1/SparkV/part-00000
-rw-r--r--   3 user1 user1          0 2014-06-26 09:19 /user/user1/SparkV/part-00001




Spark-Shell Console Log
---------------------


-------------------------------------------
Time: 1403789836000 ms
-------------------------------------------
(f,3)
(fsd,2)
(sdf,2)
(fds,1)
(sd,3)

14/06/26 09:37:16 INFO scheduler.JobScheduler: Finished job streaming job 1403789836000 ms.1 from job set of time 1403789836000 ms
14/06/26 09:37:16 INFO storage.MemoryStore: ensureFreeSpace(8) called with curMem=327, maxMem=286339891
14/06/26 09:37:16 INFO storage.MemoryStore: Block input-0-1403789836000 stored as bytes to memory (size 8.0 B, free 273.1 MB)
14/06/26 09:37:16 INFO storage.BlockManagerInfo: Added input-0-1403789836000 in memory on localhost:49784 (size: 8.0 B, free: 273.1 MB)
14/06/26 09:37:16 INFO storage.BlockManagerMaster: Updated info of block input-0-1403789836000
14/06/26 09:37:16 WARN storage.BlockManager: Block input-0-1403789836000 already exists on this machine; not re-adding it
14/06/26 09:37:16 INFO receiver.BlockGenerator: Pushed block input-0-1403789836000
14/06/26 09:37:16 INFO storage.MemoryStore: ensureFreeSpace(15) called with curMem=335, maxMem=286339891
14/06/26 09:37:16 INFO storage.MemoryStore: Block input-0-1403789836200 stored as bytes to memory (size 15.0 B, free 273.1 MB)
14/06/26 09:37:16 INFO storage.BlockManagerInfo: Added input-0-1403789836200 in memory on localhost:49784 (size: 15.0 B, free: 273.1 MB)
14/06/26 09:37:16 INFO storage.BlockManagerMaster: Updated info of block input-0-1403789836200
14/06/26 09:37:16 WARN storage.BlockManager: Block input-0-1403789836200 already exists on this machine; not re-adding it
14/06/26 09:37:16 INFO receiver.BlockGenerator: Pushed block input-0-1403789836200
14/06/26 09:37:16 INFO storage.MemoryStore: ensureFreeSpace(8) called with curMem=350, maxMem=286339891
14/06/26 09:37:16 INFO storage.MemoryStore: Block input-0-1403789836400 stored as bytes to memory (size 8.0 B, free 273.1 MB)
14/06/26 09:37:16 INFO storage.BlockManagerInfo: Added input-0-1403789836400 in memory on localhost:49784 (size: 8.0 B, free: 273.1 MB)
14/06/26 09:37:16 INFO storage.BlockManagerMaster: Updated info of block input-0-1403789836400
14/06/26 09:37:16 WARN storage.BlockManager: Block input-0-1403789836400 already exists on this machine; not re-adding it
14/06/26 09:37:16 INFO receiver.BlockGenerator: Pushed block input-0-1403789836400
14/06/26 09:37:16 INFO storage.MemoryStore: ensureFreeSpace(9) called with curMem=358, maxMem=286339891
14/06/26 09:37:16 INFO storage.MemoryStore: Block input-0-1403789836600 stored as bytes to memory (size 9.0 B, free 273.1 MB)
14/06/26 09:37:16 INFO storage.BlockManagerInfo: Added input-0-1403789836600 in memory on localhost:49784 (size: 9.0 B, free: 273.1 MB)
14/06/26 09:37:16 INFO storage.BlockManagerMaster: Updated info of block input-0-1403789836600
14/06/26 09:37:16 WARN storage.BlockManager: Block input-0-1403789836600 already exists on this machine; not re-adding it
14/06/26 09:37:16 INFO receiver.BlockGenerator: Pushed block input-0-1403789836600
14/06/26 09:37:17 INFO storage.MemoryStore: ensureFreeSpace(14) called with curMem=367, maxMem=286339891
14/06/26 09:37:17 INFO storage.MemoryStore: Block input-0-1403789836800 stored as bytes to memory (size 14.0 B, free 273.1 MB)
14/06/26 09:37:17 INFO storage.BlockManagerInfo: Added input-0-1403789836800 in memory on localhost:49784 (size: 14.0 B, free: 273.1 MB)
14/06/26 09:37:17 INFO storage.BlockManagerMaster: Updated info of block input-0-1403789836800
14/06/26 09:37:17 WARN storage.BlockManager: Block input-0-1403789836800 already exists on this machine; not re-adding it
14/06/26 09:37:17 INFO receiver.BlockGenerator: Pushed block input-0-1403789836800
14/06/26 09:37:18 INFO scheduler.ReceiverTracker: Stream 0 received 6 blocks
14/06/26 09:37:18 INFO scheduler.JobScheduler: Added jobs for time 1403789838000 ms

乍一看,我的猜測是您應該嘗試使用local [4]而不是僅使用local,這樣Spark可以安排更多任務。

嘗試

wordCounts8.saveAsTextFiles(“ hdfs:// Node1:8020 / user / root / Spark8”,“ log”)

==========或

在“ Spark8”之后加上一些時間戳

wordCounts8.saveAsTextFiles(“ hdfs:// Node1:8020 / user / root / Spark8” + System.currentTimeMillis()。toString())

===========這對我來說適用於spark 1.3,看看是否對您有用

我有同樣的問題。

嘗試跑步

hadoop fs -cat hdfs://Node1:8020/user/root/Spark8

(hadoop命令對您可能有所不同。對我來說,我必須使用/ a / bin / hadoop訪問它,但這是特定於您的設置的)

看看是否返回:

cat: `hdfs://Node1:8020/user/root/Spark8': Is a directory

如果確實如此,那么正如您在評論中所說,您應該能夠在該目錄中看到_SUCCESS文件以及一些part- *文件。

至此,我的問題解決了。 但是,寫入HDFS似乎還有其他問題。

至於為什么文件仍然為空的原因,我建議切換到Spark1.4.0,因為CDH5.4可能會更好。 另外,如果您在使用HDFS權限時遇到問題,則必須執行

hadoop dfs -chmod -R 0777 /your_hdfs_folder

為了具有寫權限。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM