Google Cloud PubSub: Not sending/receiving all messages from Cloud Functions

Question

Summary : My client code triggers 861 background Google Cloud Function by publishing messages to a Pub/Sub topic. Each Cloud Function performs a task, uploads results to Google Storage, and publishing messages to another Pub/Sub topic where the client code is listening. Client code does not receive all the messages although all Cloud Functions executed (verified by the number of results in Google Storage).

Server side : I have a background Google Cloud Function which is triggered each time a message is published to a TRIGGER Pub/Sub topic. The custom attributes of the message data act as function parameter depending upon which the function performs a certain task. It then upload the result to a bucket in Google Storage and publishes a message (with taskID and execution timing details) to RESULTS Pub/Sub topic (different than the one used to trigger this function).

Client side : I need to perform 861 different tasks which requires calling the Cloud Function with 861 slightly different inputs. These tasks are similar and it takes between 20 seconds to 2 minutes (median is about 1 minute) for the Cloud Function to execute them. I have created a python script for this that I run from the Google Cloud Shell (or a local machine shell). The client python script publishes 861 messages to the TRIGGER Pub/Sub topic that triggers as many Cloud Functions concurrently, each of which is passed a unique taskID in the rage [0, 860]. The client python script then polls the RESULTS Pub/Sub topic in a "synchronous pull" way for any messages. The Cloud Function, after performing the task publishes message to RESULTS Pub/Sub topic with the unique taskID and timing details. This unique taskID is used by the client to identify from which task the message is from. It also helps in identifying duplicate messages which are discarded.

Basic steps :

Client python script publishes 861 messages (each with unique taskID) to TRIGGER Pub/Sub topic and waits for result messages from the Cloud Function.
861 different Cloud Functions are called, each of which performs a task, uploads results to Google Storage, and publishes message (with taskID and execution timing details) to RESULTS Pub/Sub topic.
The client grabs all the messages synchronously and marks the task as complete.

Problem : When the client is polling for the messages from RESULTS Pub/Sub topic, I did not receive messages for all the taskID. I am sure that the Cloud Function got called and executed properly (I have 861 results in Google Storage bucket). I repeated this for a number of times and it occurred every time. Strangely, the number of missing taskID change every time as well as different taskID go missing across different runs. I am also keeping a track of number of duplicate taskID received. The number of unique taskID received, missing, and repeated are given in the table for 5 independent runs.

SN   # of Tasks  Received  Missing  Repeated
1     861          860      1        25
2     861          840      21       3
3     861          851      10       1
4     861          837      24       3
5     861          856      5        1

I am not sure where this problem might be arising from. Given the random nature of the number as well as taskIDs that go missing, I suspect there is some bug in the Pub/Sub at-least-once delivery logic. If in the Cloud Function, I sleep for a few seconds instead of performing the task, for example with time.sleep(5), then everything works just fine (I receive all 861 taskID at the client).

Code to reproduce this problem.

In the following, main.py along with requirements.txt are deployed as Google Cloud Function while client.py is the client code. Run the client with 100 concurrent tasks as python client.py 100 which repeats it 5 times. Different number of taskID go missing each time.

requirements.txt

google-cloud-pubsub

main.py

"""
This file is deployed as Google Cloud Function. This function starts,
sleeps for some seconds and pulishes back the taskID.

Deloyment:
    gcloud functions deploy gcf_run --runtime python37 --trigger-topic <TRIGGER_TOPIC> --memory=128MB --timeout=300s
"""

import time
from random import randint
from google.cloud import pubsub_v1

# Global variables
project_id = "<Your Google Cloud Project ID>"  # Your Google Cloud Project ID
topic_name = "<RESULTS_TOPIC>"  # Your Pub/Sub topic name


def gcf_run(data, context):
    """Background Cloud Function to be triggered by Pub/Sub.
    Args:
         data (dict): The dictionary with data specific to this type of event.
         context (google.cloud.functions.Context): The Cloud Functions event
         metadata.
    """

    # Message should contain taskID (in addition to the data)
    if 'attributes' in data:
        attributes = data['attributes']
        if 'taskID' in attributes:
            taskID = attributes['taskID']
        else:
            print('taskID missing!')
            return
    else:
        print('attributes missing!')
        return

    # Sleep for a random time beteen 30 seconds to 1.5 minutes
    print("Start execution for {}".format(taskID))
    sleep_time = randint(30, 90)  # sleep for this many seconds
    time.sleep(sleep_time)  # sleep for few seconds

    # Marks this task complete by publishing a message to Pub/Sub.
    data = u'Message number {}'.format(taskID)
    data = data.encode('utf-8')  # Data must be a bytestring
    publisher = pubsub_v1.PublisherClient()
    topic_path = publisher.topic_path(project_id, topic_name)
    publisher.publish(topic_path, data=data, taskID=taskID)

    return

client.py

"""
The client code creates the given number of tasks and publishes to Pub/Sub,
which in turn calls the Google Cloud Functions concurrently.
Run:
    python client.py 100
"""

from __future__ import print_function
import sys
import time
from google.cloud import pubsub_v1

# Global variables
project_id = "<Google Cloud Project ID>" # Google Cloud Project ID
topic_name = "<TRIGGER_TOPIC>"    # Pub/Sub topic name to publish
subscription_name = "<subscriber to RESULTS_TOPIC>"  # Pub/Sub subscription name
num_experiments = 5  # number of times to repeat the experiment
time_between_exp = 120.0 # number of seconds between experiments

# Initialize the Publisher (to send commands that invoke Cloud Functions)
# as well as Subscriber (to receive results written by the Cloud Functions)
# Configure the batch to publish as soon as there is one kilobyte
# of data or one second has passed.
batch_settings = pubsub_v1.types.BatchSettings(
    max_bytes=1024,  # One kilobyte
    max_latency=1,   # One second
)
publisher = pubsub_v1.PublisherClient(batch_settings)
topic_path = publisher.topic_path(project_id, topic_name)

subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path(
    project_id, subscription_name)


class Task:
    """
    A task which will execute the Cloud Function once.

    Attributes:
        taskID (int)       : A unique number given to a task (starting from 0).
        complete (boolean) : Flag to indicate if this task has completed.
    """
    def __init__(self, taskID):
        self.taskID = taskID
        self.complete = False

    def start(self):
        """
        Start the execution of Cloud Function by publishing a message with
        taskID to the Pub/Sub topic.
        """
        data = u'Message number {}'.format(self.taskID)
        data = data.encode('utf-8')  # Data must be a bytestring
        publisher.publish(topic_path, data=data, taskID=str(self.taskID))

    def end(self):
        """
        Mark the end of this task.
            Returns (boolean):
                True if normal, False if task was already marked before.
        """
        # If this task was not complete, mark it as completed
        if not self.complete:
            self.complete = True
            return True

        return False
    # [END of Task Class]


def createTasks(num_tasks):
    """
    Create a list of tasks and return it.
        Args:
            num_tasks (int) : Number of tasks (Cloud Function calls)
        Returns (list):
            A list of tasks.
    """
    all_tasks = list()
    for taskID in range(0, num_tasks):
        all_tasks.append(Task(taskID=taskID))

    return all_tasks


def receiveResults(all_tasks):
    """
    Receives messages from the Pub/Sub subscription. I am using a blocking
    Synchronous Pull instead of the usual asynchronous pull with a callback
    funtion as I rely on a polling pattern to retrieve messages.
    See: https://cloud.google.com/pubsub/docs/pull
        Args:
            all_tasks (list) : List of all tasks.
    """
    num_tasks = len(all_tasks)
    total_msg_received = 0  # track the number of messages received
    NUM_MESSAGES = 10  # maximum number of messages to pull synchronously
    TIMEOUT = 600.0    # number of seconds to wait for response (10 minutes)

    # Keep track of elapsed time and exit if > TIMEOUT
    __MyFuncStartTime = time.time()
    __MyFuncElapsedTime = 0.0

    print('Listening for messages on {}'.format(subscription_path))
    while (total_msg_received < num_tasks) and (__MyFuncElapsedTime < TIMEOUT):
        # The subscriber pulls a specific number of messages.
        response = subscriber.pull(subscription_path,
            max_messages=NUM_MESSAGES, timeout=TIMEOUT, retry=None)
        ack_ids = []

        # Keep track of all received messages
        for received_message in response.received_messages:
            if received_message.message.attributes:
                attributes = received_message.message.attributes
                taskID = int(attributes['taskID'])
                if all_tasks[taskID].end():
                    # increment count only if task completes the first time
                    # if False, we received a duplicate message
                    total_msg_received += 1
                #     print("Received taskID = {} ({} of {})".format(
                #         taskID, total_msg_received, num_tasks))
                # else:
                #     print('REPEATED: taskID {} was already marked'.format(taskID))
            else:
                print('attributes missing!')

            ack_ids.append(received_message.ack_id)

        # Acknowledges the received messages so they will not be sent again.
        if ack_ids:
            subscriber.acknowledge(subscription_path, ack_ids)

        time.sleep(0.2)  # Wait 200 ms before polling again
        __MyFuncElapsedTime = time.time() - __MyFuncStartTime
        # print("{} s elapsed. Listening again.".format(__MyFuncElapsedTime))

    # if total_msg_received != num_tasks, function exit due to timeout
    if total_msg_received != num_tasks:
        print("WARNING: *** Receiver timed out! ***")
    print("Received {} messages out of {}. Done.".format(
        total_msg_received, num_tasks))


def main(num_tasks):
    """
    Main execution point of the program
    """

    for experiment_num in range(1, num_experiments + 1):
        print("Starting experiment {} of {} with {} tasks".format(
            experiment_num, num_experiments, num_tasks))
        # Create all tasks and start them
        all_tasks = createTasks(num_tasks)
        for task in all_tasks:     # Start all tasks
            task.start()
        print("Published {} taskIDs".format(num_tasks))

        receiveResults(all_tasks)  # Receive message from Pub/Sub subscription

        print("Waiting {} seconds\n\n".format(time_between_exp))
        time.sleep(time_between_exp)  # sleep between experiments


if __name__ == "__main__":
    if(len(sys.argv) != 2):
        print("usage: python client.py  <num_tasks>")
        print("    num_tasks: Number of concurrent Cloud Function calls")
        sys.exit()

    num_tasks = int(sys.argv[1])
    main(num_tasks)

Answer 1

In your cloud function, in this line:

publisher.publish(topic_path, data=data, taskID=taskID)

You are not waiting for the future that publisher.publish returns. This means you cannot be guaranteed that the publish onto the topic has actually happened when you fall off the end of the gcf_run function, but the message on the TRIGGER topic cloud functions subscription is ACK-ed anyway.

Instead, to wait until the publish occurs for the cloud function to terminate, this should be:

publisher.publish(topic_path, data=data, taskID=taskID).result()

You should also avoid bringing up and tearing down the publisher client on each function call, instead having the client as a global variable.

Google Cloud PubSub: Not sending/receiving all messages from Cloud Functions

Question

1 answers

solution1
7 ACCPTED 2019-04-04 15:02:50

Google Cloud PubSub: Not sending/receiving all messages from Cloud Functions

Question

1 answers

solution1 7 ACCPTED 2019-04-04 15:02:50

solution1
7 ACCPTED 2019-04-04 15:02:50