简体   繁体   English

从 kafka 主题打印 Pyspark 流数据

[英]Print Pyspark Streaming Data from a kafka Topic

I am new to kafka and pyspark and trying to write simple program, SO I have 2 files in kafka Topics in JSon format and I am reading this from pyspark streaming.我是 kafka 和 pyspark 的新手,并尝试编写简单的程序,所以我在 kafka 主题中有 2 个文件,格式为 JSon,我正在从 Z77BB59DCD89559748E5DB56956C1006 流式传输中读取此文件。

My Producer code is as follows:我的生产者代码如下:

  from kafka import *
import json
import time
import boto3
import json
from Consumer_Group import *
from json import loads
class producer :
            def json_serializer(data):
                    return json.dumps(data).encode("utf-8")

            def read_s3():
                p1 = KafkaProducer(bootstrap_servers=['localhost:9092'], value_serializer=producer.json_serializer)
                s3 = boto3.resource('s3')
                bucket = s3.Bucket('kakfa')
                for obj in bucket.objects.all():
                    key = obj.key
                    body = obj.get()['Body'].read().decode('utf-8')
                p1.send("Uber_Eats",body)
                p1.flush()

My Consumer code is as follows:我的消费者代码如下:

from pyspark.sql import SparkSession
from kafka import *
import time
class consumer:
                def read_from_topic(self,spark):
                        df = spark.readStream \
                            .format("kafka") \
                            .option("kafka.bootstrap.servers", "localhost:9092") \
                            .option("subscribe", "Uber_Eats") \
                             .option("startingOffsets", "earliest") \
                            .load()
                        df.createOrReplaceTempView("kafka")
                        spark.sql("select * from kafka")
                        print(df.isStreaming())
                                  


                def get_consumer(self):
                    consumer = KafkaConsumer("Uber_Eats", group_id='group1', bootstrap_servers=
                    "localhost:9092")
                    return  consumer

                def print_details(self,c1):
                    #    self.consumer=self.get_consumer(self)
                        # Read and print message from consumer
                     try:
                                for msg in c1:
                                    print(msg.topic, msg.value)
                                print("Done")
                     except Exception  as e:
                                print(e)

Main Class:主要 Class:

from Producer_Group import *
from Consumer_Group import *
from Spark_Connection import *
class client:
    def transfer(self):
        spark = connection.get_connection(self)
        producer.read_s3()
        c1 = consumer.get_consumer(spark)
        consumer.read_from_topic(self,spark)
      #  consumer.print_details(self,c1)

c=client()
c.transfer()

Sample Data in S3 that i am reading into kafka Topic:我正在读入 kafka 主题的 S3 中的示例数据:

{
    
        {
            "Customer Number": "1",
            "Customer Name": "Aditya",
            "Restaurant Number": "2201",
            "Restaurant NameOrdered": "Bawarchi",
            "Number of Items": "3",
            "price": "10",
            "Operating Start hours": "9:00",
            "Operating End hours": "23:00"
        },
        {
            "Customer Number": "2",
            "Customer Name": "Sarva",
            "Restaurant Number": "2202",
            "Restaurant NameOrdered": "Sarvana Bhavan",
            "Number of Items": "4",
            "price": "20",
            "Operating Start hours": "8:00",
            "Operating End hours": "20:00"
        },
        {
            "Customer Number": "3",
            "Customer Name": "Kala",
            "Restaurant Number": "2203",
            "Restaurant NameOrdered": "Taco Bell",
            "Number of Items": "5",
            "price": "30",
            "Operating Start hours": "11:00",
            "Operating End hours": "21:00"
        }
    
}

What Have I tried so far: : I have tried to print on console so as to check for condition and if it passes only then insert it into databse.到目前为止我尝试了什么::我尝试在控制台上打印以检查条件,如果它通过了,则将其插入数据库。 to check the condtiion, I am reading data from "read_from_topic" function and creating a view (createOrReplaceTempView) to see data, but nothing is printing, can some one please guide me how to print and verify if My conditions or data is read correclty?为了检查条件,我正在从“read_from_topic”function 中读取数据并创建一个视图(createOrReplaceTempView)来查看数据,但没有打印,有人可以指导我如何打印并验证我的条件或数据是否被正确读取?

Thanks in Advance !!!!提前致谢 !!!!

creating a view (createOrReplaceTempView) to see data, but nothing is printing创建视图 (createOrReplaceTempView) 以查看数据,但没有打印任何内容

Because spark.sql returns a new Dataframe.因为spark.sql返回一个新的 Dataframe。

If you want to print it, then you'll need如果你想打印它,那么你需要

spark.sql("select * from kafka").show()

However, this alone will be at least two byte array columns, not JSON strings, so you'll want to define a schema at some point to extract anything or CAST to at least have human readable data但是,仅此一项就至少是两个字节数组列,而不是 JSON 字符串,因此您需要在某些时候定义一个模式以提取任何内容或CAST以至少具有人类可读的数据

Also worth pointing out that the data you've shown is not valid JSON, and boto3 isn't necessary since Spark can read files from S3 itself (and thus Kafka isn't strictly needed since you could just take S3 data directly into your final location, with a Spark persist() function in between)还值得指出的是,您显示的数据不是有效的 JSON,并且boto3不是必需的,因为 Spark 可以从 S3 本身读取文件(因此并不严格需要 Kafka,因为您可以直接将 S3 数据带入您的最终位置,中间有一个 Spark persist() function)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pyspark - 打印来自 Kafka 的消息 - Pyspark - print messages from Kafka 如何使用 pyspark 从 Kafka 获取并打印一行? 必须使用 writeStream.start() 执行带有流源的查询 - How to get and print one row from Kafka with pyspark? Queries with streaming sources must be executed with writeStream.start() pyspark流与卡夫卡错误 - pyspark streaming with kafka error 如何从kafka主题中读取json字符串到pyspark dataframe? - How to read json string from kafka topic into pyspark dataframe? 在 Pyspark 中使用流式 API 读取 Kafka 主题 - 无法写入控制台或发送到任何其他接收器的问题 - Reading a Kafka topic using streaming api in Pyspark - Issue not able to write to console or send to any other sink 使用 pyspark 使用 Kafka 主题失败 - Failed on consuming Kafka topic with pyspark 从 Kafka 主题中提取特定数据 - Extract particular data from Kafka topic 尝试使用 Kafka 和 pyspark 从 postgreSQL 中的 spark 编写流媒体 dataframe - Trying to write a streaming dataframe from spark in postgreSQL with Kafka and pyspark 来自 Kafka 的 pySpark Structured Streaming 不会输出到控制台进行调试 - pySpark Structured Streaming from Kafka does not output to console for debugging 有没有办法使用pyspark从Kafka到Cassandra设置结构化流 - Is there a way to set up structured streaming with pyspark from Kafka to Cassandra
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM