簡體   English   中英

從 kafka 主題打印 Pyspark 流數據

[英]Print Pyspark Streaming Data from a kafka Topic

我是 kafka 和 pyspark 的新手,並嘗試編寫簡單的程序,所以我在 kafka 主題中有 2 個文件,格式為 JSon,我正在從 Z77BB59DCD89559748E5DB56956C1006 流式傳輸中讀取此文件。

我的生產者代碼如下:

  from kafka import *
import json
import time
import boto3
import json
from Consumer_Group import *
from json import loads
class producer :
            def json_serializer(data):
                    return json.dumps(data).encode("utf-8")

            def read_s3():
                p1 = KafkaProducer(bootstrap_servers=['localhost:9092'], value_serializer=producer.json_serializer)
                s3 = boto3.resource('s3')
                bucket = s3.Bucket('kakfa')
                for obj in bucket.objects.all():
                    key = obj.key
                    body = obj.get()['Body'].read().decode('utf-8')
                p1.send("Uber_Eats",body)
                p1.flush()

我的消費者代碼如下:

from pyspark.sql import SparkSession
from kafka import *
import time
class consumer:
                def read_from_topic(self,spark):
                        df = spark.readStream \
                            .format("kafka") \
                            .option("kafka.bootstrap.servers", "localhost:9092") \
                            .option("subscribe", "Uber_Eats") \
                             .option("startingOffsets", "earliest") \
                            .load()
                        df.createOrReplaceTempView("kafka")
                        spark.sql("select * from kafka")
                        print(df.isStreaming())
                                  


                def get_consumer(self):
                    consumer = KafkaConsumer("Uber_Eats", group_id='group1', bootstrap_servers=
                    "localhost:9092")
                    return  consumer

                def print_details(self,c1):
                    #    self.consumer=self.get_consumer(self)
                        # Read and print message from consumer
                     try:
                                for msg in c1:
                                    print(msg.topic, msg.value)
                                print("Done")
                     except Exception  as e:
                                print(e)

主要 Class:

from Producer_Group import *
from Consumer_Group import *
from Spark_Connection import *
class client:
    def transfer(self):
        spark = connection.get_connection(self)
        producer.read_s3()
        c1 = consumer.get_consumer(spark)
        consumer.read_from_topic(self,spark)
      #  consumer.print_details(self,c1)

c=client()
c.transfer()

我正在讀入 kafka 主題的 S3 中的示例數據:

{
    
        {
            "Customer Number": "1",
            "Customer Name": "Aditya",
            "Restaurant Number": "2201",
            "Restaurant NameOrdered": "Bawarchi",
            "Number of Items": "3",
            "price": "10",
            "Operating Start hours": "9:00",
            "Operating End hours": "23:00"
        },
        {
            "Customer Number": "2",
            "Customer Name": "Sarva",
            "Restaurant Number": "2202",
            "Restaurant NameOrdered": "Sarvana Bhavan",
            "Number of Items": "4",
            "price": "20",
            "Operating Start hours": "8:00",
            "Operating End hours": "20:00"
        },
        {
            "Customer Number": "3",
            "Customer Name": "Kala",
            "Restaurant Number": "2203",
            "Restaurant NameOrdered": "Taco Bell",
            "Number of Items": "5",
            "price": "30",
            "Operating Start hours": "11:00",
            "Operating End hours": "21:00"
        }
    
}

到目前為止我嘗試了什么::我嘗試在控制台上打印以檢查條件,如果它通過了,則將其插入數據庫。 為了檢查條件,我正在從“read_from_topic”function 中讀取數據並創建一個視圖(createOrReplaceTempView)來查看數據,但沒有打印,有人可以指導我如何打印並驗證我的條件或數據是否被正確讀取?

提前致謝 !!!!

創建視圖 (createOrReplaceTempView) 以查看數據,但沒有打印任何內容

因為spark.sql返回一個新的 Dataframe。

如果你想打印它,那么你需要

spark.sql("select * from kafka").show()

但是,僅此一項就至少是兩個字節數組列,而不是 JSON 字符串,因此您需要在某些時候定義一個模式以提取任何內容或CAST以至少具有人類可讀的數據

還值得指出的是,您顯示的數據不是有效的 JSON,並且boto3不是必需的,因為 Spark 可以從 S3 本身讀取文件(因此並不嚴格需要 Kafka,因為您可以直接將 S3 數據帶入您的最終位置,中間有一個 Spark persist() function)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM