通過 TCP 端口從 NodeJS 發送數據到 Apache Spark

Question

我有一個 NodeJS 服務器，負責從 API 流式傳輸數據並將數據推送到本地 TCP 端口 8080，Apache Spark 正在監聽該端口。

const net = require('net');
const client = new net.Socket();
const axios = require('axios');

client.connect(8080, '127.0.0.1');
client.on('connect', async () => {
  const res = await axios.get('https://api.co.za', {
    responseType: 'stream',
  });
  res.data.on('data', chunk => {
    client.write(chunk);
  });
});

然后 Apache Spark 嘗試從該端口讀取數據。

import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.{ SparkConf, SparkContext }

object DataStream {
  def main(args: Array[String]) {
    val sparkConfig = new SparkConf()
      .setAppName("Data Stream")
      .setMaster(sys.env.get("spark.master")
      .getOrElse("local[*]"))
    val sparkContext = new SparkContext(sparkConfig)
    sparkContext.setLogLevel("ERROR")

    val streamingContext = new StreamingContext(sparkContext, Seconds(1))

    val data = streamingContext.socketTextStream("127.0.0.1", 8080)
    data.print()

    streamingContext.start()
    streamingContext.awaitTermination()
  }
}

然后我用netcat打開8080端口： nc -l 8080

這是我的問題，如果我先啟動 Node 進程，它會將數據推送到端口，但我沒有看到 Spark 對數據做出反應。 如果我先啟動 Spark，我的 Node 進程會說它正在寫入，但我看不到數據到達端口 8080。

如果我在nc -l 8080之后直接通過 netcat 發送數據，Spark 讀取它沒有問題。

這些本地端口是否存在某種客戶排他性？ 是否有另一種方法可以打開以這種方式使用的端口？

操作系統：Ubuntu 19.10

Answer 1

我知道您正在嘗試做什么，但是，您的兩個應用程序似乎都充當客戶端。 如果我沒記錯的話，Spark 從數據源中提取數據。 因此，您需要更改 NodeJS 代碼以充當服務器。 我最終嘗試了這個，它與 pySpark 一起工作：

const net = require("net");

const server = net.createServer(function (socket) {
  socket.write("Write you chunk here.");
  socket.end("You can also send a connection treatment string here.");
});

server.on("error", (err) => {
  console.log(`Error in server:\n${err}`);
});

server.listen(8080, "127.0.0.1", () => {
  console.log("Nodejs server ready to respond.");
});

這樣，Spark 應用程序就會向 nodejs 服務器發送請求並等待響應。 因此，spark 應用程序從流式源中提取數據（在這種情況下，由 NodeJS 服務器模擬）。 我沒有嘗試運行您的 Java 代碼，但鑒於您使用 netcat 進行了檢查，問題不在您的 java 代碼中。 我將這個 python 代碼用於一個簡單的字數統計應用程序，以驗證它是否有效：

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# Create a local StreamingContext with two working thread and batch interval of 1 second
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 1)

# Listen for line on a server from a specific port
lines = ssc.socketTextStream("localhost", 8080)

# Split lines into words (using a flatMap, one to many)
words = lines.flatMap(lambda line: line.split(" "))

# Create count value pair
pairs = words.map(lambda word: (word, 1))

# Group by key
wordCounts = pairs.reduceByKey(lambda x, y: x+y)

# Print results
wordCounts.pprint()

ssc.start()             # Start the computation
ssc.awaitTermination()  # Wait for the computation to terminate

通過 TCP 端口從 NodeJS 發送數據到 Apache Spark

問題描述

1 個解決方案

解決方案1
0 2020-05-18 15:02:21

通過 TCP 端口從 NodeJS 發送數據到 Apache Spark

問題描述

1 個解決方案

解決方案1 0 2020-05-18 15:02:21

解決方案1
0 2020-05-18 15:02:21