从 kafka 主题打印 Pyspark 流数据

Question

I am new to kafka and pyspark and trying to write simple program, SO I have 2 files in kafka Topics in JSon format and I am reading this from pyspark streaming.我是 kafka 和 pyspark 的新手，并尝试编写简单的程序，所以我在 kafka 主题中有 2 个文件，格式为 JSon，我正在从 Z77BB59DCD89559748E5DB56956C1006 流式传输中读取此文件。

My Producer code is as follows:我的生产者代码如下：

  from kafka import *
import json
import time
import boto3
import json
from Consumer_Group import *
from json import loads
class producer :
            def json_serializer(data):
                    return json.dumps(data).encode("utf-8")

            def read_s3():
                p1 = KafkaProducer(bootstrap_servers=['localhost:9092'], value_serializer=producer.json_serializer)
                s3 = boto3.resource('s3')
                bucket = s3.Bucket('kakfa')
                for obj in bucket.objects.all():
                    key = obj.key
                    body = obj.get()['Body'].read().decode('utf-8')
                p1.send("Uber_Eats",body)
                p1.flush()

My Consumer code is as follows:我的消费者代码如下：

from pyspark.sql import SparkSession
from kafka import *
import time
class consumer:
                def read_from_topic(self,spark):
                        df = spark.readStream \
                            .format("kafka") \
                            .option("kafka.bootstrap.servers", "localhost:9092") \
                            .option("subscribe", "Uber_Eats") \
                             .option("startingOffsets", "earliest") \
                            .load()
                        df.createOrReplaceTempView("kafka")
                        spark.sql("select * from kafka")
                        print(df.isStreaming())
                                  


                def get_consumer(self):
                    consumer = KafkaConsumer("Uber_Eats", group_id='group1', bootstrap_servers=
                    "localhost:9092")
                    return  consumer

                def print_details(self,c1):
                    #    self.consumer=self.get_consumer(self)
                        # Read and print message from consumer
                     try:
                                for msg in c1:
                                    print(msg.topic, msg.value)
                                print("Done")
                     except Exception  as e:
                                print(e)

Main Class:主要 Class：

from Producer_Group import *
from Consumer_Group import *
from Spark_Connection import *
class client:
    def transfer(self):
        spark = connection.get_connection(self)
        producer.read_s3()
        c1 = consumer.get_consumer(spark)
        consumer.read_from_topic(self,spark)
      #  consumer.print_details(self,c1)

c=client()
c.transfer()

Sample Data in S3 that i am reading into kafka Topic:我正在读入 kafka 主题的 S3 中的示例数据：

{
    
        {
            "Customer Number": "1",
            "Customer Name": "Aditya",
            "Restaurant Number": "2201",
            "Restaurant NameOrdered": "Bawarchi",
            "Number of Items": "3",
            "price": "10",
            "Operating Start hours": "9:00",
            "Operating End hours": "23:00"
        },
        {
            "Customer Number": "2",
            "Customer Name": "Sarva",
            "Restaurant Number": "2202",
            "Restaurant NameOrdered": "Sarvana Bhavan",
            "Number of Items": "4",
            "price": "20",
            "Operating Start hours": "8:00",
            "Operating End hours": "20:00"
        },
        {
            "Customer Number": "3",
            "Customer Name": "Kala",
            "Restaurant Number": "2203",
            "Restaurant NameOrdered": "Taco Bell",
            "Number of Items": "5",
            "price": "30",
            "Operating Start hours": "11:00",
            "Operating End hours": "21:00"
        }
    
}

What Have I tried so far: : I have tried to print on console so as to check for condition and if it passes only then insert it into databse.到目前为止我尝试了什么：：我尝试在控制台上打印以检查条件，如果它通过了，则将其插入数据库。 to check the condtiion, I am reading data from "read_from_topic" function and creating a view (createOrReplaceTempView) to see data, but nothing is printing, can some one please guide me how to print and verify if My conditions or data is read correclty?为了检查条件，我正在从“read_from_topic”function 中读取数据并创建一个视图（createOrReplaceTempView）来查看数据，但没有打印，有人可以指导我如何打印并验证我的条件或数据是否被正确读取？

Thanks in Advance !!!!提前致谢！！！！

Answer 1

creating a view (createOrReplaceTempView) to see data, but nothing is printing创建视图 (createOrReplaceTempView) 以查看数据，但没有打印任何内容

Because spark.sql returns a new Dataframe.因为spark.sql返回一个新的 Dataframe。

If you want to print it, then you'll need如果你想打印它，那么你需要

spark.sql("select * from kafka").show()

However, this alone will be at least two byte array columns, not JSON strings, so you'll want to define a schema at some point to extract anything or CAST to at least have human readable data但是，仅此一项就至少是两个字节数组列，而不是 JSON 字符串，因此您需要在某些时候定义一个模式以提取任何内容或CAST以至少具有人类可读的数据

Also worth pointing out that the data you've shown is not valid JSON, and boto3 isn't necessary since Spark can read files from S3 itself (and thus Kafka isn't strictly needed since you could just take S3 data directly into your final location, with a Spark persist() function in between)还值得指出的是，您显示的数据不是有效的 JSON，并且boto3不是必需的，因为 Spark 可以从 S3 本身读取文件（因此并不严格需要 Kafka，因为您可以直接将 S3 数据带入您的最终位置，中间有一个 Spark persist() function)

从 kafka 主题打印 Pyspark 流数据

问题描述

1 个解决方案

解决方案1
1 2021-05-08 23:15:01

从 kafka 主题打印 Pyspark 流数据

问题描述

1 个解决方案

解决方案1 1 2021-05-08 23:15:01

解决方案1
1 2021-05-08 23:15:01