Google Cloud PubSub：不发送/接收来自 Cloud Functions 的所有消息

Question

Summary : My client code triggers 861 background Google Cloud Function by publishing messages to a Pub/Sub topic.摘要：我的客户端代码通过将消息发布到 Pub/Sub 主题来触发 861 后台 Google Cloud Function。 Each Cloud Function performs a task, uploads results to Google Storage, and publishing messages to another Pub/Sub topic where the client code is listening.每个 Cloud Function 执行一项任务，将结果上传到 Google Storage，并将消息发布到客户端代码正在侦听的另一个 Pub/Sub 主题。 Client code does not receive all the messages although all Cloud Functions executed (verified by the number of results in Google Storage).尽管执行了所有 Cloud Functions（通过 Google Storage 中的结果数量进行验证），但客户端代码并未收到所有消息。

Server side : I have a background Google Cloud Function which is triggered each time a message is published to a TRIGGER Pub/Sub topic.服务器端：我有一个后台 Google Cloud Function，每次将消息发布到 TRIGGER Pub/Sub 主题时都会触发该功能。 The custom attributes of the message data act as function parameter depending upon which the function performs a certain task.消息数据的自定义属性充当函数参数，具体取决于函数执行特定任务。 It then upload the result to a bucket in Google Storage and publishes a message (with taskID and execution timing details) to RESULTS Pub/Sub topic (different than the one used to trigger this function).然后将结果上传到 Google Storage 中的存储桶，并向 RESULTS Pub/Sub 主题（与用于触发此功能的主题不同）发布一条消息（带有 taskID 和执行时间详细信息）。

Client side : I need to perform 861 different tasks which requires calling the Cloud Function with 861 slightly different inputs.客户端：我需要执行 861 个不同的任务，这需要使用 861 个稍微不同的输入调用 Cloud Function。 These tasks are similar and it takes between 20 seconds to 2 minutes (median is about 1 minute) for the Cloud Function to execute them.这些任务是相似的，Cloud Function 执行它们需要 20 秒到 2 分钟（中位数约为 1 分钟）。 I have created a python script for this that I run from the Google Cloud Shell (or a local machine shell).我为此创建了一个从 Google Cloud Shell（或本地机器 shell）运行的 python 脚本。 The client python script publishes 861 messages to the TRIGGER Pub/Sub topic that triggers as many Cloud Functions concurrently, each of which is passed a unique taskID in the rage [0, 860].客户端 python 脚本将 861 条消息发布到 TRIGGER Pub/Sub 主题，该主题同时触发了尽可能多的 Cloud Functions，每个都被传递了一个唯一的 taskID 范围 [0, 860]。 The client python script then polls the RESULTS Pub/Sub topic in a "synchronous pull" way for any messages.然后，客户端 python 脚本以“同步拉取”方式轮询 RESULTS Pub/Sub 主题以获取任何消息。 The Cloud Function, after performing the task publishes message to RESULTS Pub/Sub topic with the unique taskID and timing details. Cloud Function 执行任务后，使用唯一的 taskID 和时间详细信息将消息发布到 RESULTS Pub/Sub 主题。 This unique taskID is used by the client to identify from which task the message is from.客户端使用这个唯一的 taskID 来识别消息来自哪个任务。 It also helps in identifying duplicate messages which are discarded.它还有助于识别被丢弃的重复消息。

Basic steps :基本步骤：

Client python script publishes 861 messages (each with unique taskID) to TRIGGER Pub/Sub topic and waits for result messages from the Cloud Function.客户端 python 脚本向 TRIGGER Pub/Sub 主题发布 861 条消息（每条消息都有唯一的 taskID）并等待来自 Cloud Function 的结果消息。
861 different Cloud Functions are called, each of which performs a task, uploads results to Google Storage, and publishes message (with taskID and execution timing details) to RESULTS Pub/Sub topic.调用了 861 个不同的 Cloud Functions，每个函数执行一个任务，将结果上传到 Google Storage，并将消息（带有 taskID 和执行时间详细信息）发布到 RESULTS Pub/Sub 主题。
The client grabs all the messages synchronously and marks the task as complete.客户端同步获取所有消息并将任务标记为完成。

Problem : When the client is polling for the messages from RESULTS Pub/Sub topic, I did not receive messages for all the taskID.问题：当客户端轮询来自 RESULTS Pub/Sub 主题的消息时，我没有收到所有 taskID 的消息。 I am sure that the Cloud Function got called and executed properly (I have 861 results in Google Storage bucket).我确信 Cloud Function 被正确调用和执行（我在 Google Storage 存储桶中有 861 个结果）。 I repeated this for a number of times and it occurred every time.我重复了很多次，每次都会发生。 Strangely, the number of missing taskID change every time as well as different taskID go missing across different runs.奇怪的是，丢失的 taskID 的数量每次都会改变，并且不同的 taskID 在不同的运行中丢失。 I am also keeping a track of number of duplicate taskID received.我还跟踪收到的重复 taskID 的数量。 The number of unique taskID received, missing, and repeated are given in the table for 5 independent runs.表中给出了 5 次独立运行的接收、丢失和重复的唯一任务 ID 的数量。

SN   # of Tasks  Received  Missing  Repeated
1     861          860      1        25
2     861          840      21       3
3     861          851      10       1
4     861          837      24       3
5     861          856      5        1

I am not sure where this problem might be arising from.我不确定这个问题可能来自哪里。 Given the random nature of the number as well as taskIDs that go missing, I suspect there is some bug in the Pub/Sub at-least-once delivery logic.鉴于数字的随机性质以及丢失的 taskID，我怀疑 Pub/Sub 至少一次交付逻辑中存在一些错误。 If in the Cloud Function, I sleep for a few seconds instead of performing the task, for example with time.sleep(5), then everything works just fine (I receive all 861 taskID at the client).如果在 Cloud Functions 中，我睡了几秒钟而不是执行任务，例如使用 time.sleep(5)，那么一切正常（我在客户端收到所有 861 taskID）。

Code to reproduce this problem.重现此问题的代码。

In the following, main.py along with requirements.txt are deployed as Google Cloud Function while client.py is the client code.在下文中， main.py和requirements.txt被部署为 Google Cloud Function，而client.py是客户端代码。 Run the client with 100 concurrent tasks as python client.py 100 which repeats it 5 times.以python client.py 100形式运行具有 100 个并发任务的客户端，重复 5 次。 Different number of taskID go missing each time.每次丢失不同数量的 taskID。

requirements.txt

google-cloud-pubsub

main.py

"""
This file is deployed as Google Cloud Function. This function starts,
sleeps for some seconds and pulishes back the taskID.

Deloyment:
    gcloud functions deploy gcf_run --runtime python37 --trigger-topic <TRIGGER_TOPIC> --memory=128MB --timeout=300s
"""

import time
from random import randint
from google.cloud import pubsub_v1

# Global variables
project_id = "<Your Google Cloud Project ID>"  # Your Google Cloud Project ID
topic_name = "<RESULTS_TOPIC>"  # Your Pub/Sub topic name


def gcf_run(data, context):
    """Background Cloud Function to be triggered by Pub/Sub.
    Args:
         data (dict): The dictionary with data specific to this type of event.
         context (google.cloud.functions.Context): The Cloud Functions event
         metadata.
    """

    # Message should contain taskID (in addition to the data)
    if 'attributes' in data:
        attributes = data['attributes']
        if 'taskID' in attributes:
            taskID = attributes['taskID']
        else:
            print('taskID missing!')
            return
    else:
        print('attributes missing!')
        return

    # Sleep for a random time beteen 30 seconds to 1.5 minutes
    print("Start execution for {}".format(taskID))
    sleep_time = randint(30, 90)  # sleep for this many seconds
    time.sleep(sleep_time)  # sleep for few seconds

    # Marks this task complete by publishing a message to Pub/Sub.
    data = u'Message number {}'.format(taskID)
    data = data.encode('utf-8')  # Data must be a bytestring
    publisher = pubsub_v1.PublisherClient()
    topic_path = publisher.topic_path(project_id, topic_name)
    publisher.publish(topic_path, data=data, taskID=taskID)

    return

client.py

"""
The client code creates the given number of tasks and publishes to Pub/Sub,
which in turn calls the Google Cloud Functions concurrently.
Run:
    python client.py 100
"""

from __future__ import print_function
import sys
import time
from google.cloud import pubsub_v1

# Global variables
project_id = "<Google Cloud Project ID>" # Google Cloud Project ID
topic_name = "<TRIGGER_TOPIC>"    # Pub/Sub topic name to publish
subscription_name = "<subscriber to RESULTS_TOPIC>"  # Pub/Sub subscription name
num_experiments = 5  # number of times to repeat the experiment
time_between_exp = 120.0 # number of seconds between experiments

# Initialize the Publisher (to send commands that invoke Cloud Functions)
# as well as Subscriber (to receive results written by the Cloud Functions)
# Configure the batch to publish as soon as there is one kilobyte
# of data or one second has passed.
batch_settings = pubsub_v1.types.BatchSettings(
    max_bytes=1024,  # One kilobyte
    max_latency=1,   # One second
)
publisher = pubsub_v1.PublisherClient(batch_settings)
topic_path = publisher.topic_path(project_id, topic_name)

subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path(
    project_id, subscription_name)


class Task:
    """
    A task which will execute the Cloud Function once.

    Attributes:
        taskID (int)       : A unique number given to a task (starting from 0).
        complete (boolean) : Flag to indicate if this task has completed.
    """
    def __init__(self, taskID):
        self.taskID = taskID
        self.complete = False

    def start(self):
        """
        Start the execution of Cloud Function by publishing a message with
        taskID to the Pub/Sub topic.
        """
        data = u'Message number {}'.format(self.taskID)
        data = data.encode('utf-8')  # Data must be a bytestring
        publisher.publish(topic_path, data=data, taskID=str(self.taskID))

    def end(self):
        """
        Mark the end of this task.
            Returns (boolean):
                True if normal, False if task was already marked before.
        """
        # If this task was not complete, mark it as completed
        if not self.complete:
            self.complete = True
            return True

        return False
    # [END of Task Class]


def createTasks(num_tasks):
    """
    Create a list of tasks and return it.
        Args:
            num_tasks (int) : Number of tasks (Cloud Function calls)
        Returns (list):
            A list of tasks.
    """
    all_tasks = list()
    for taskID in range(0, num_tasks):
        all_tasks.append(Task(taskID=taskID))

    return all_tasks


def receiveResults(all_tasks):
    """
    Receives messages from the Pub/Sub subscription. I am using a blocking
    Synchronous Pull instead of the usual asynchronous pull with a callback
    funtion as I rely on a polling pattern to retrieve messages.
    See: https://cloud.google.com/pubsub/docs/pull
        Args:
            all_tasks (list) : List of all tasks.
    """
    num_tasks = len(all_tasks)
    total_msg_received = 0  # track the number of messages received
    NUM_MESSAGES = 10  # maximum number of messages to pull synchronously
    TIMEOUT = 600.0    # number of seconds to wait for response (10 minutes)

    # Keep track of elapsed time and exit if > TIMEOUT
    __MyFuncStartTime = time.time()
    __MyFuncElapsedTime = 0.0

    print('Listening for messages on {}'.format(subscription_path))
    while (total_msg_received < num_tasks) and (__MyFuncElapsedTime < TIMEOUT):
        # The subscriber pulls a specific number of messages.
        response = subscriber.pull(subscription_path,
            max_messages=NUM_MESSAGES, timeout=TIMEOUT, retry=None)
        ack_ids = []

        # Keep track of all received messages
        for received_message in response.received_messages:
            if received_message.message.attributes:
                attributes = received_message.message.attributes
                taskID = int(attributes['taskID'])
                if all_tasks[taskID].end():
                    # increment count only if task completes the first time
                    # if False, we received a duplicate message
                    total_msg_received += 1
                #     print("Received taskID = {} ({} of {})".format(
                #         taskID, total_msg_received, num_tasks))
                # else:
                #     print('REPEATED: taskID {} was already marked'.format(taskID))
            else:
                print('attributes missing!')

            ack_ids.append(received_message.ack_id)

        # Acknowledges the received messages so they will not be sent again.
        if ack_ids:
            subscriber.acknowledge(subscription_path, ack_ids)

        time.sleep(0.2)  # Wait 200 ms before polling again
        __MyFuncElapsedTime = time.time() - __MyFuncStartTime
        # print("{} s elapsed. Listening again.".format(__MyFuncElapsedTime))

    # if total_msg_received != num_tasks, function exit due to timeout
    if total_msg_received != num_tasks:
        print("WARNING: *** Receiver timed out! ***")
    print("Received {} messages out of {}. Done.".format(
        total_msg_received, num_tasks))


def main(num_tasks):
    """
    Main execution point of the program
    """

    for experiment_num in range(1, num_experiments + 1):
        print("Starting experiment {} of {} with {} tasks".format(
            experiment_num, num_experiments, num_tasks))
        # Create all tasks and start them
        all_tasks = createTasks(num_tasks)
        for task in all_tasks:     # Start all tasks
            task.start()
        print("Published {} taskIDs".format(num_tasks))

        receiveResults(all_tasks)  # Receive message from Pub/Sub subscription

        print("Waiting {} seconds\n\n".format(time_between_exp))
        time.sleep(time_between_exp)  # sleep between experiments


if __name__ == "__main__":
    if(len(sys.argv) != 2):
        print("usage: python client.py  <num_tasks>")
        print("    num_tasks: Number of concurrent Cloud Function calls")
        sys.exit()

    num_tasks = int(sys.argv[1])
    main(num_tasks)

Answer 1

In your cloud function, in this line:在您的云函数中，在这一行中：

publisher.publish(topic_path, data=data, taskID=taskID)发布者.发布（主题路径，数据=数据，任务ID=任务ID）

You are not waiting for the future that publisher.publish returns.您不是在等待publisher.publish 返回的未来。 This means you cannot be guaranteed that the publish onto the topic has actually happened when you fall off the end of the gcf_run function, but the message on the TRIGGER topic cloud functions subscription is ACK-ed anyway.这意味着您不能保证当您从gcf_run函数结束时发布到主题上确实发生了，但是 TRIGGER 主题云函数订阅上的消息无论如何都会被确认。

Instead, to wait until the publish occurs for the cloud function to terminate, this should be:相反，要等到发布发生以终止云功能，这应该是：

publisher.publish(topic_path, data=data, taskID=taskID).result()

You should also avoid bringing up and tearing down the publisher client on each function call, instead having the client as a global variable.您还应该避免在每次函数调用时启动和拆除发布者客户端，而是将客户端作为全局变量。

Google Cloud PubSub：不发送/接收来自 Cloud Functions 的所有消息

问题描述

1 个解决方案

解决方案1
7 已采纳 2019-04-04 15:02:50

Google Cloud PubSub：不发送/接收来自 Cloud Functions 的所有消息

问题描述

1 个解决方案

解决方案1 7 已采纳 2019-04-04 15:02:50

解决方案1
7 已采纳 2019-04-04 15:02:50