简体   繁体   English

使用火花流从 pubsublite 接收消息时出现问题

[英]Problems receving messages from pubsublite with spark streaming

I have a problem, I try to receive the messages from pubsublite in real time from a spark cluster on GCP, but they are grouped in blocks of one minute.我有一个问题,我尝试从 GCP 上的 spark 集群实时接收来自 pubsublite 的消息,但它们以一分钟为单位进行分组。

My code:我的代码:

producer.py生产者.py

import random
import time
from proj_BOLSA import settings
from google.cloud.pubsublite.cloudpubsub import PublisherClient
from google.cloud.pubsublite.types import (
    CloudRegion,
    CloudZone,
    MessageMetadata,
    TopicPath,
)
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/home/vm-sergiolr-development/Desktop/projecteemo_code/proj_BOLSA/credentials/gcp_authentication.json"

regional = True

if regional:
    location = CloudRegion(settings.REGION)
else:
    location = CloudZone(CloudRegion(settings.REGION), settings.ZONE)

topic_path = TopicPath(settings.PROJECT_NUMBER, location, settings.TOPIC)

# PublisherClient() must be used in a `with` block or have __enter__() called before use.
with PublisherClient() as publisher_client:
    for i in range(6000):
        data = "number: "+str(random.randint(0, 300))
        api_future = publisher_client.publish(topic_path, data.encode("utf-8"))
        # result() blocks. To resolve API futures asynchronously, use add_done_callback().
        message_id = api_future.result()
        message_metadata = MessageMetadata.decode(message_id)
        print(
            f"Published {data} to {topic_path} with partition {message_metadata.partition.value} and offset {message_metadata.cursor.offset}."
        )
        time.sleep(20)

consumer.py消费者.py

from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.types import StringType
from pyspark.sql.types import StructType, StructField, StringType, ArrayType
from pyspark.sql.functions import from_json, col
# TODO(developer):
project_number = xxxxxxxx
location = "europe-west1"
subscription_id = "s_producer"


spark = SparkSession.builder.appName("read-app").master("yarn").getOrCreate()

sdf = (
    spark.readStream.format("pubsublite")
    .option(
        "pubsublite.subscription",
        f"projects/{project_number}/locations/{location}/subscriptions/{subscription_id}",
    )
    .option("rowsPerSecond", 1).load()
)


sdf = sdf.withColumn("data", sdf.data.cast(StringType()))

query = (
    sdf.writeStream.format("console")
    .outputMode("append")
    .trigger(processingTime="1 second")
    .option("truncate", False)
    .start()
)

# Wait 120 seconds (must be >= 60 seconds) to start receiving messages.
query.awaitTermination()
query.stop()

results结果

-------------------------------------------
Batch: 1
-------------------------------------------
+---------------------------------------------------------------------+---------+------+---+-----------+--------------------------+---------------+----------+
|subscription                                                         |partition|offset|key|data       |publish_timestamp         |event_timestamp|attributes|
+---------------------------------------------------------------------+---------+------+---+-----------+--------------------------+---------------+----------+
|projects/658599344059/locations/europe-west1/subscriptions/s_producer|0        |5942  |[] |number: 74 |2022-08-05 08:42:47.796738|null           |{}        |
|projects/658599344059/locations/europe-west1/subscriptions/s_producer|0        |5943  |[] |number: 288|2022-08-05 08:43:07.849063|null           |{}        |
|projects/658599344059/locations/europe-west1/subscriptions/s_producer|0        |5944  |[] |number: 156|2022-08-05 08:43:27.952513|null           |{}        |
+---------------------------------------------------------------------+---------+------+---+-----------+--------------------------+---------------+----------+

-------------------------------------------
Batch: 2
-------------------------------------------
+---------------------------------------------------------------------+---------+------+---+-----------+--------------------------+---------------+----------+
|subscription                                                         |partition|offset|key|data       |publish_timestamp         |event_timestamp|attributes|
+---------------------------------------------------------------------+---------+------+---+-----------+--------------------------+---------------+----------+
|projects/658599344059/locations/europe-west1/subscriptions/s_producer|0        |5945  |[] |number: 162|2022-08-05 08:43:48.00867 |null           |{}        |
|projects/658599344059/locations/europe-west1/subscriptions/s_producer|0        |5946  |[] |number: 262|2022-08-05 08:44:08.062032|null           |{}        |
|projects/658599344059/locations/europe-west1/subscriptions/s_producer|0        |5947  |[] |number: 59 |2022-08-05 08:44:28.11492 |null           |{}        |
+---------------------------------------------------------------------+---------+------+---+-----------+--------------------------+---------------+----------+

-------------------------------------------
Batch: 3
-------------------------------------------
+---------------------------------------------------------------------+---------+------+---+-----------+--------------------------+---------------+----------+
|subscription                                                         |partition|offset|key|data       |publish_timestamp         |event_timestamp|attributes|
+---------------------------------------------------------------------+---------+------+---+-----------+--------------------------+---------------+----------+
|projects/658599344059/locations/europe-west1/subscriptions/s_producer|0        |5948  |[] |number: 54 |2022-08-05 08:44:48.168997|null           |{}        |
|projects/658599344059/locations/europe-west1/subscriptions/s_producer|0        |5949  |[] |number: 206|2022-08-05 08:45:08.225344|null           |{}        |
|projects/658599344059/locations/europe-west1/subscriptions/s_producer|0        |5950  |[] |number: 109|2022-08-05 08:45:28.328074|null           |{}        |
+---------------------------------------------------------------------+---------+------+---+-----------+--------------------------+---------------+----------+

what is the error that I am committing to not be able to read the messages as they are coming to me instead of grouping them in batches of 1 minute.我承诺无法阅读消息,而不是每 1 分钟将它们分组,这是什么错误?

Thank you!谢谢!

You are using the micro batch streaming mode where the spark runtime decides how many messages to read from the source at a time.您正在使用微批处理流模式,其中 spark 运行时决定一次从源读取多少条消息。 It's actually reading ~30s windows, not 1 minute windows of data.它实际上正在读取约 30 秒的 windows,而不是 1 分钟的 windows 数据。

To read smaller time windows for small amounts of data, you would need to use the experimental continuous processing mode https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#continuous-processing要为少量数据读取更短的时间 windows,您需要使用实验连续处理模式https://spark.apache.org/programming-guide.html#continuous-streaming-processing-structured-guide.html/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 来自快照的 Pub/Sub 消息未在 Dataflow 流式传输管道中处理 - Pub/Sub messages from snapshot not processed in a Dataflow streaming pipeline 在 pubsublite 中有异常提交偏移量 - Having exception committing offset in pubsublite Pubsublite 订阅第一条消息非常慢 - Pubsublite subscribe extremely slow for first message 如果我使用 Dataproc,它如何处理从 Apache Hadoop 和 Spark 到 Dataproc 的实时流数据? - If I use Dataproc, how does it process real-time streaming data from Apache Hadoop and Spark to Dataproc? 如何为 Spark Structural Streaming 创建 KinesisSink - How to create a KinesisSink for Spark Structural Streaming Spark Structured Streaming - stderr 被填满 - Spark Structured Streaming - stderr getting filled up 如何访问 Spark Streaming 应用程序的统计端点? - How to access statistics endpoint for a Spark Streaming application? 使用托管身份访问 Spark Streaming 中的 Eventhub - Accessing Eventhub in spark streaming using managed identiry Spark Streaming 接收器仅处理一条记录 - Spark Streaming receiver is processing only one record Azure 数据块 spark streaming with autoloader - Azure data bricks spark streaming with autoloader
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM