如何使用 Databricks 在 Apache Spark 上编译 PySpark 中的 While 循环语句

Question

I'm trying send data to my Data Lake with a While Loop.我正在尝试使用 While 循环将数据发送到我的数据湖。

Basically, the intention is to continually loop through code and send data to my Data Lake when ever data received from my Azure Service Bus using the following code:基本上，目的是在使用以下代码从我的 Azure 服务总线接收到数据时，不断循环代码并将数据发送到我的数据湖：

This code receives message from my Service Bus此代码从我的服务总线接收消息

def myfunc():
  with ServiceBusClient.from_connection_string(CONNECTION_STR) as client:
      # max_wait_time specifies how long the receiver should wait with no incoming messages before stopping receipt.
      # Default is None; to receive forever.

        with client.get_queue_receiver(QUEUE_NAME, session_id=session_id, max_wait_time=5) as receiver:
          for msg in receiver:
              # print("Received: " + str(msg))
              themsg = json.loads(str(msg))
              # complete the message so that the message is removed from the queue
              receiver.complete_message(msg)
              return themsg

This code assigns a variable to the message:此代码为消息分配一个变量：

result = myfunc()

The following code sends the message to my data lake以下代码将消息发送到我的数据湖

rdd = sc.parallelize([json.dumps(result)])
spark.read.json(rdd) \
  .write.mode("overwrite").json('/mnt/lake/RAW/FormulaClassification/F1Area/')

I would like help looping through the code to continually checking for messages and sending the results to my data lake.我需要帮助遍历代码以不断检查消息并将结果发送到我的数据湖。

I believe the solution is accomplished with a While Loop but not sure我相信解决方案是通过 While 循环完成的，但不确定

Answer 1

Just because you're using Spark doesn't mean you cannot loop仅仅因为您使用的是 Spark 并不意味着您不能循环

First off all, you're only returning the first message from your receiver, so it should look like this首先，你只是从你的接收者那里返回第一条消息，所以它应该是这样的

with client.get_queue_receiver(QUEUE_NAME, session_id=session_id, max_wait_time=5) as receiver:
    msg = str(next(receiver)) 
          
    # print("Received: " + msg)
    themsg = json.loads(msg)
    # complete the message so that the message is removed from the queue
              
    receiver.complete_message(msg)
    return themsg

To answer your question,要回答你的问题，

while True:
    result = json.dumps(myfunc())

    rdd = sc.parallelize([result])
    spark.read.json(rdd) \  # You should use rdd.toDF().json here instead 
      .write.mode("overwrite").json('/mnt/lake/RAW/FormulaClassification/F1Area/')

Keep in mind that the output file names aren't consistent and you might not want them to be overwritten请记住，output 文件名不一致，您可能不希望它们被覆盖

Alternatively, you should look into writing your own Source / SparkDataStream class that defines SparkSQL sources so that you don't need a loop in your main method and it's natively handled by Spark或者，您应该考虑编写自己的Source / SparkDataStream class 来定义 SparkSQL 源，这样您的 main 方法中就不需要循环，它由 Spark 本机处理

如何使用 Databricks 在 Apache Spark 上编译 PySpark 中的 While 循环语句

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-02-26 15:42:03

如何使用 Databricks 在 Apache Spark 上编译 PySpark 中的 While 循环语句

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-02-26 15:42:03

解决方案1
1 已采纳 2022-02-26 15:42:03