[英]Print Pyspark Streaming Data from a kafka Topic
我是 kafka 和 pyspark 的新手,並嘗試編寫簡單的程序,所以我在 kafka 主題中有 2 個文件,格式為 JSon,我正在從 Z77BB59DCD89559748E5DB56956C1006 流式傳輸中讀取此文件。
我的生產者代碼如下:
from kafka import *
import json
import time
import boto3
import json
from Consumer_Group import *
from json import loads
class producer :
def json_serializer(data):
return json.dumps(data).encode("utf-8")
def read_s3():
p1 = KafkaProducer(bootstrap_servers=['localhost:9092'], value_serializer=producer.json_serializer)
s3 = boto3.resource('s3')
bucket = s3.Bucket('kakfa')
for obj in bucket.objects.all():
key = obj.key
body = obj.get()['Body'].read().decode('utf-8')
p1.send("Uber_Eats",body)
p1.flush()
我的消費者代碼如下:
from pyspark.sql import SparkSession
from kafka import *
import time
class consumer:
def read_from_topic(self,spark):
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "Uber_Eats") \
.option("startingOffsets", "earliest") \
.load()
df.createOrReplaceTempView("kafka")
spark.sql("select * from kafka")
print(df.isStreaming())
def get_consumer(self):
consumer = KafkaConsumer("Uber_Eats", group_id='group1', bootstrap_servers=
"localhost:9092")
return consumer
def print_details(self,c1):
# self.consumer=self.get_consumer(self)
# Read and print message from consumer
try:
for msg in c1:
print(msg.topic, msg.value)
print("Done")
except Exception as e:
print(e)
主要 Class:
from Producer_Group import *
from Consumer_Group import *
from Spark_Connection import *
class client:
def transfer(self):
spark = connection.get_connection(self)
producer.read_s3()
c1 = consumer.get_consumer(spark)
consumer.read_from_topic(self,spark)
# consumer.print_details(self,c1)
c=client()
c.transfer()
我正在讀入 kafka 主題的 S3 中的示例數據:
{
{
"Customer Number": "1",
"Customer Name": "Aditya",
"Restaurant Number": "2201",
"Restaurant NameOrdered": "Bawarchi",
"Number of Items": "3",
"price": "10",
"Operating Start hours": "9:00",
"Operating End hours": "23:00"
},
{
"Customer Number": "2",
"Customer Name": "Sarva",
"Restaurant Number": "2202",
"Restaurant NameOrdered": "Sarvana Bhavan",
"Number of Items": "4",
"price": "20",
"Operating Start hours": "8:00",
"Operating End hours": "20:00"
},
{
"Customer Number": "3",
"Customer Name": "Kala",
"Restaurant Number": "2203",
"Restaurant NameOrdered": "Taco Bell",
"Number of Items": "5",
"price": "30",
"Operating Start hours": "11:00",
"Operating End hours": "21:00"
}
}
到目前為止我嘗試了什么::我嘗試在控制台上打印以檢查條件,如果它通過了,則將其插入數據庫。 為了檢查條件,我正在從“read_from_topic”function 中讀取數據並創建一個視圖(createOrReplaceTempView)來查看數據,但沒有打印,有人可以指導我如何打印並驗證我的條件或數據是否被正確讀取?
提前致謝 !!!!
創建視圖 (createOrReplaceTempView) 以查看數據,但沒有打印任何內容
因為spark.sql
返回一個新的 Dataframe。
如果你想打印它,那么你需要
spark.sql("select * from kafka").show()
但是,僅此一項就至少是兩個字節數組列,而不是 JSON 字符串,因此您需要在某些時候定義一個模式以提取任何內容或CAST
以至少具有人類可讀的數據
還值得指出的是,您顯示的數據不是有效的 JSON,並且boto3
不是必需的,因為 Spark 可以從 S3 本身讀取文件(因此並不嚴格需要 Kafka,因為您可以直接將 S3 數據帶入您的最終位置,中間有一個 Spark persist()
function)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.