[英]Google Cloud PubSub: Not sending/receiving all messages from Cloud Functions
Summary : My client code triggers 861 background Google Cloud Function by publishing messages to a Pub/Sub topic.摘要:我的客户端代码通过将消息发布到 Pub/Sub 主题来触发 861 后台 Google Cloud Function。 Each Cloud Function performs a task, uploads results to Google Storage, and publishing messages to another Pub/Sub topic where the client code is listening.
每个 Cloud Function 执行一项任务,将结果上传到 Google Storage,并将消息发布到客户端代码正在侦听的另一个 Pub/Sub 主题。 Client code does not receive all the messages although all Cloud Functions executed (verified by the number of results in Google Storage).
尽管执行了所有 Cloud Functions(通过 Google Storage 中的结果数量进行验证),但客户端代码并未收到所有消息。
Server side : I have a background Google Cloud Function which is triggered each time a message is published to a TRIGGER Pub/Sub topic.服务器端:我有一个后台 Google Cloud Function,每次将消息发布到 TRIGGER Pub/Sub 主题时都会触发该功能。 The custom attributes of the message data act as function parameter depending upon which the function performs a certain task.
消息数据的自定义属性充当函数参数,具体取决于函数执行特定任务。 It then upload the result to a bucket in Google Storage and publishes a message (with taskID and execution timing details) to RESULTS Pub/Sub topic (different than the one used to trigger this function).
然后将结果上传到 Google Storage 中的存储桶,并向 RESULTS Pub/Sub 主题(与用于触发此功能的主题不同)发布一条消息(带有 taskID 和执行时间详细信息)。
Client side : I need to perform 861 different tasks which requires calling the Cloud Function with 861 slightly different inputs.客户端:我需要执行 861 个不同的任务,这需要使用 861 个稍微不同的输入调用 Cloud Function。 These tasks are similar and it takes between 20 seconds to 2 minutes (median is about 1 minute) for the Cloud Function to execute them.
这些任务是相似的,Cloud Function 执行它们需要 20 秒到 2 分钟(中位数约为 1 分钟)。 I have created a python script for this that I run from the Google Cloud Shell (or a local machine shell).
我为此创建了一个从 Google Cloud Shell(或本地机器 shell)运行的 python 脚本。 The client python script publishes 861 messages to the TRIGGER Pub/Sub topic that triggers as many Cloud Functions concurrently, each of which is passed a unique taskID in the rage [0, 860].
客户端 python 脚本将 861 条消息发布到 TRIGGER Pub/Sub 主题,该主题同时触发了尽可能多的 Cloud Functions,每个都被传递了一个唯一的 taskID 范围 [0, 860]。 The client python script then polls the RESULTS Pub/Sub topic in a "synchronous pull" way for any messages.
然后,客户端 python 脚本以“同步拉取”方式轮询 RESULTS Pub/Sub 主题以获取任何消息。 The Cloud Function, after performing the task publishes message to RESULTS Pub/Sub topic with the unique taskID and timing details.
Cloud Function 执行任务后,使用唯一的 taskID 和时间详细信息将消息发布到 RESULTS Pub/Sub 主题。 This unique taskID is used by the client to identify from which task the message is from.
客户端使用这个唯一的 taskID 来识别消息来自哪个任务。 It also helps in identifying duplicate messages which are discarded.
它还有助于识别被丢弃的重复消息。
Basic steps :基本步骤:
Problem : When the client is polling for the messages from RESULTS Pub/Sub topic, I did not receive messages for all the taskID.问题:当客户端轮询来自 RESULTS Pub/Sub 主题的消息时,我没有收到所有 taskID 的消息。 I am sure that the Cloud Function got called and executed properly (I have 861 results in Google Storage bucket).
我确信 Cloud Function 被正确调用和执行(我在 Google Storage 存储桶中有 861 个结果)。 I repeated this for a number of times and it occurred every time.
我重复了很多次,每次都会发生。 Strangely, the number of missing taskID change every time as well as different taskID go missing across different runs.
奇怪的是,丢失的 taskID 的数量每次都会改变,并且不同的 taskID 在不同的运行中丢失。 I am also keeping a track of number of duplicate taskID received.
我还跟踪收到的重复 taskID 的数量。 The number of unique taskID received, missing, and repeated are given in the table for 5 independent runs.
表中给出了 5 次独立运行的接收、丢失和重复的唯一任务 ID 的数量。
SN # of Tasks Received Missing Repeated
1 861 860 1 25
2 861 840 21 3
3 861 851 10 1
4 861 837 24 3
5 861 856 5 1
I am not sure where this problem might be arising from.我不确定这个问题可能来自哪里。 Given the random nature of the number as well as taskIDs that go missing, I suspect there is some bug in the Pub/Sub at-least-once delivery logic.
鉴于数字的随机性质以及丢失的 taskID,我怀疑 Pub/Sub 至少一次交付逻辑中存在一些错误。 If in the Cloud Function, I sleep for a few seconds instead of performing the task, for example with time.sleep(5), then everything works just fine (I receive all 861 taskID at the client).
如果在 Cloud Functions 中,我睡了几秒钟而不是执行任务,例如使用 time.sleep(5),那么一切正常(我在客户端收到所有 861 taskID)。
Code to reproduce this problem.重现此问题的代码。
In the following, main.py
along with requirements.txt
are deployed as Google Cloud Function while client.py
is the client code.在下文中,
main.py
和requirements.txt
被部署为 Google Cloud Function,而client.py
是客户端代码。 Run the client with 100 concurrent tasks as python client.py 100
which repeats it 5 times.以
python client.py 100
形式运行具有 100 个并发任务的客户端,重复 5 次。 Different number of taskID go missing each time.每次丢失不同数量的 taskID。
requirements.txt
google-cloud-pubsub
main.py
"""
This file is deployed as Google Cloud Function. This function starts,
sleeps for some seconds and pulishes back the taskID.
Deloyment:
gcloud functions deploy gcf_run --runtime python37 --trigger-topic <TRIGGER_TOPIC> --memory=128MB --timeout=300s
"""
import time
from random import randint
from google.cloud import pubsub_v1
# Global variables
project_id = "<Your Google Cloud Project ID>" # Your Google Cloud Project ID
topic_name = "<RESULTS_TOPIC>" # Your Pub/Sub topic name
def gcf_run(data, context):
"""Background Cloud Function to be triggered by Pub/Sub.
Args:
data (dict): The dictionary with data specific to this type of event.
context (google.cloud.functions.Context): The Cloud Functions event
metadata.
"""
# Message should contain taskID (in addition to the data)
if 'attributes' in data:
attributes = data['attributes']
if 'taskID' in attributes:
taskID = attributes['taskID']
else:
print('taskID missing!')
return
else:
print('attributes missing!')
return
# Sleep for a random time beteen 30 seconds to 1.5 minutes
print("Start execution for {}".format(taskID))
sleep_time = randint(30, 90) # sleep for this many seconds
time.sleep(sleep_time) # sleep for few seconds
# Marks this task complete by publishing a message to Pub/Sub.
data = u'Message number {}'.format(taskID)
data = data.encode('utf-8') # Data must be a bytestring
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(project_id, topic_name)
publisher.publish(topic_path, data=data, taskID=taskID)
return
client.py
"""
The client code creates the given number of tasks and publishes to Pub/Sub,
which in turn calls the Google Cloud Functions concurrently.
Run:
python client.py 100
"""
from __future__ import print_function
import sys
import time
from google.cloud import pubsub_v1
# Global variables
project_id = "<Google Cloud Project ID>" # Google Cloud Project ID
topic_name = "<TRIGGER_TOPIC>" # Pub/Sub topic name to publish
subscription_name = "<subscriber to RESULTS_TOPIC>" # Pub/Sub subscription name
num_experiments = 5 # number of times to repeat the experiment
time_between_exp = 120.0 # number of seconds between experiments
# Initialize the Publisher (to send commands that invoke Cloud Functions)
# as well as Subscriber (to receive results written by the Cloud Functions)
# Configure the batch to publish as soon as there is one kilobyte
# of data or one second has passed.
batch_settings = pubsub_v1.types.BatchSettings(
max_bytes=1024, # One kilobyte
max_latency=1, # One second
)
publisher = pubsub_v1.PublisherClient(batch_settings)
topic_path = publisher.topic_path(project_id, topic_name)
subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path(
project_id, subscription_name)
class Task:
"""
A task which will execute the Cloud Function once.
Attributes:
taskID (int) : A unique number given to a task (starting from 0).
complete (boolean) : Flag to indicate if this task has completed.
"""
def __init__(self, taskID):
self.taskID = taskID
self.complete = False
def start(self):
"""
Start the execution of Cloud Function by publishing a message with
taskID to the Pub/Sub topic.
"""
data = u'Message number {}'.format(self.taskID)
data = data.encode('utf-8') # Data must be a bytestring
publisher.publish(topic_path, data=data, taskID=str(self.taskID))
def end(self):
"""
Mark the end of this task.
Returns (boolean):
True if normal, False if task was already marked before.
"""
# If this task was not complete, mark it as completed
if not self.complete:
self.complete = True
return True
return False
# [END of Task Class]
def createTasks(num_tasks):
"""
Create a list of tasks and return it.
Args:
num_tasks (int) : Number of tasks (Cloud Function calls)
Returns (list):
A list of tasks.
"""
all_tasks = list()
for taskID in range(0, num_tasks):
all_tasks.append(Task(taskID=taskID))
return all_tasks
def receiveResults(all_tasks):
"""
Receives messages from the Pub/Sub subscription. I am using a blocking
Synchronous Pull instead of the usual asynchronous pull with a callback
funtion as I rely on a polling pattern to retrieve messages.
See: https://cloud.google.com/pubsub/docs/pull
Args:
all_tasks (list) : List of all tasks.
"""
num_tasks = len(all_tasks)
total_msg_received = 0 # track the number of messages received
NUM_MESSAGES = 10 # maximum number of messages to pull synchronously
TIMEOUT = 600.0 # number of seconds to wait for response (10 minutes)
# Keep track of elapsed time and exit if > TIMEOUT
__MyFuncStartTime = time.time()
__MyFuncElapsedTime = 0.0
print('Listening for messages on {}'.format(subscription_path))
while (total_msg_received < num_tasks) and (__MyFuncElapsedTime < TIMEOUT):
# The subscriber pulls a specific number of messages.
response = subscriber.pull(subscription_path,
max_messages=NUM_MESSAGES, timeout=TIMEOUT, retry=None)
ack_ids = []
# Keep track of all received messages
for received_message in response.received_messages:
if received_message.message.attributes:
attributes = received_message.message.attributes
taskID = int(attributes['taskID'])
if all_tasks[taskID].end():
# increment count only if task completes the first time
# if False, we received a duplicate message
total_msg_received += 1
# print("Received taskID = {} ({} of {})".format(
# taskID, total_msg_received, num_tasks))
# else:
# print('REPEATED: taskID {} was already marked'.format(taskID))
else:
print('attributes missing!')
ack_ids.append(received_message.ack_id)
# Acknowledges the received messages so they will not be sent again.
if ack_ids:
subscriber.acknowledge(subscription_path, ack_ids)
time.sleep(0.2) # Wait 200 ms before polling again
__MyFuncElapsedTime = time.time() - __MyFuncStartTime
# print("{} s elapsed. Listening again.".format(__MyFuncElapsedTime))
# if total_msg_received != num_tasks, function exit due to timeout
if total_msg_received != num_tasks:
print("WARNING: *** Receiver timed out! ***")
print("Received {} messages out of {}. Done.".format(
total_msg_received, num_tasks))
def main(num_tasks):
"""
Main execution point of the program
"""
for experiment_num in range(1, num_experiments + 1):
print("Starting experiment {} of {} with {} tasks".format(
experiment_num, num_experiments, num_tasks))
# Create all tasks and start them
all_tasks = createTasks(num_tasks)
for task in all_tasks: # Start all tasks
task.start()
print("Published {} taskIDs".format(num_tasks))
receiveResults(all_tasks) # Receive message from Pub/Sub subscription
print("Waiting {} seconds\n\n".format(time_between_exp))
time.sleep(time_between_exp) # sleep between experiments
if __name__ == "__main__":
if(len(sys.argv) != 2):
print("usage: python client.py <num_tasks>")
print(" num_tasks: Number of concurrent Cloud Function calls")
sys.exit()
num_tasks = int(sys.argv[1])
main(num_tasks)
In your cloud function, in this line:在您的云函数中,在这一行中:
publisher.publish(topic_path, data=data, taskID=taskID)
发布者.发布(主题路径,数据=数据,任务ID=任务ID)
You are not waiting for the future that publisher.publish returns.您不是在等待publisher.publish 返回的未来。 This means you cannot be guaranteed that the publish onto the topic has actually happened when you fall off the end of the
gcf_run
function, but the message on the TRIGGER topic cloud functions subscription is ACK-ed anyway.这意味着您不能保证当您从
gcf_run
函数结束时发布到主题上确实发生了,但是 TRIGGER 主题云函数订阅上的消息无论如何都会被确认。
Instead, to wait until the publish occurs for the cloud function to terminate, this should be:相反,要等到发布发生以终止云功能,这应该是:
publisher.publish(topic_path, data=data, taskID=taskID).result()
You should also avoid bringing up and tearing down the publisher client on each function call, instead having the client as a global variable.您还应该避免在每次函数调用时启动和拆除发布者客户端,而是将客户端作为全局变量。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.