简体   繁体   English

GCP - 从 PubSub 到 BigQuery 的消息

[英]GCP - Message from PubSub to BigQuery

I need to get the data from my pubsub message and insert into bigquery.我需要从我的 pubsub 消息中获取数据并插入到 bigquery 中。

What I have:我有的:

const topicName = "-----topic-name-----";
const data = JSON.stringify({ foo: "bar" });

// Imports the Google Cloud client library
const { PubSub } = require("@google-cloud/pubsub");

// Creates a client; cache this for further use
const pubSubClient = new PubSub();

async function publishMessageWithCustomAttributes() {
  // Publishes the message as a string, e.g. "Hello, world!" or JSON.stringify(someObject)
  const dataBuffer = Buffer.from(data);

  // Add two custom attributes, origin and username, to the message
  const customAttributes = {
    origin: "nodejs-sample",
    username: "gcp",
  };

  const messageId = await pubSubClient
    .topic(topicName)
    .publish(dataBuffer, customAttributes);
  console.log(`Message ${messageId} published.`);
}

publishMessageWithCustomAttributes().catch(console.error);

I need to get the data/attributes from this message and query in BigQuery, anyone can help me?我需要从此消息中获取数据/属性并在 BigQuery 中查询,有人可以帮助我吗?

Thaks in advance!提前谢谢!

In fact, there is 2 solutions to consume the messages: either a message per message, or in bulk.事实上,有两种消费消息的解决方案:每条消息一条消息,或者批量。

Firstly, before going in detail, and because you will perform BigQuery calls (or Facebook API calls), you will spend a lot of the processing time to wait the API response.首先,在详细介绍之前,由于您将执行 BigQuery 调用(或 Facebook API 调用),您将花费大量处理时间来等待 API 响应。


  • Message per Message If you have an acceptable volume of message, you can perform a message per message processing.每条消息的消息 如果您有可接受的消息量,则可以对每条消息执行一条消息处理。 You have 2 solutions here:您在这里有 2 个解决方案:
  1. You can handle each message with Cloud Functions.您可以使用 Cloud Functions 处理每条消息。 Set the minimal amount of memory to the functions (128Mb) to limit the CPU cost and thus the global cost.将 memory 的最小数量设置为函数 (128Mb) 以限制 CPU 成本,从而限制全局成本。 Indeed, because you will wait a lot, don't spend expensive CPU cost to do nothing, Ok, you will process slowly the data when they will be there but.确实,因为你会等待很多,所以不要花费昂贵的 CPU 成本什么都不做,好吧,当数据在那里时,你会慢慢处理数据。 it's a tradeoff.这是一个权衡。

Create Cloud Function on the topic , or a Push Subscription to call a HTTP triggered Cloud Functions在主题上创建 Cloud Function或推送订阅以调用HTTP 触发的 Cloud Functions

  1. You can also handle request concurrently with Cloud Run .您还可以使用 Cloud Run 同时处理请求 Cloud Run can handle up to 250 requests concurrently (in preview), and because you will wait a lot, it's perfectly suitable. Cloud Run 最多可以同时处理 250 个请求(预览版),因为您会等待很多,所以它非常适合。 If you need more CPU and memory, you can increase these value to 4CPU and 8Gb of memory.如果您需要更多 CPU 和 memory,您可以将这些值增加到 memory 的 4CPU 和 8Gb。 It's my preferred solution.这是我的首选解决方案。

  • Bulk processing is possible if you are able to easily manage multi-cpu multi-(light)thread development.如果您能够轻松管理多 CPU 多(轻)线程开发,则可以进行批量处理。 It's easy in Go.在 Go 中很容易。 Concurrency in Node is also easy (await/async) but I don't know if it's multi-cpu capable or only single-cpu. Node 中的并发也很容易(等待/异步),但我不知道它是支持多 CPU 还是只有单 CPU。 Anyway, the principle is the following反正原理如下
  1. Create a pull subscription on PubSub topic在 PubSub 主题上创建请求订阅
  2. Create a Cloud Run (better for multi-cpu, but also work with App Engine or Cloud Functions) that will listen the pull subscription for a while (let's say 10 minutes)创建一个 Cloud Run(更适合多 CPU,但也可以与 App Engine 或 Cloud Functions 一起使用),它将监听拉取订阅一段时间(比如说 10 分钟)
  3. For each message pulled, an async process is performed: get the data/attribute, make the call to BigQuery, ack the message对于提取的每条消息,都会执行一个异步过程:获取数据/属性,调用 BigQuery,确认消息
  4. After the timeout of the pull connexion, close the message listening, finish the current message processing and exit gracefully (return 200 HTTP code) pull connexion超时后,关闭消息监听,完成当前消息处理,优雅退出(返回200 HTTP码)
  5. Create a Cloud Scheduler that call every 10 minutes the Cloud Run service.创建一个 Cloud Scheduler,每 10 分钟调用一次 Cloud Run 服务。 Set the timeout to 15 minutes and discard retries.将超时设置为 15 分钟并放弃重试。
  6. Deploy the Cloud Run service with a timeout of 15 minutes.部署 Cloud Run 服务,超时时间为 15 分钟。

This solution offers a better message throughput processing (you can process more than 250 message per Cloud Run service), but don't have a real advantage because you are limited by the API call latency.此解决方案提供了更好的消息吞吐量处理(每个 Cloud Run 服务可以处理超过 250 条消息),但没有真正的优势,因为您受到 API 调用延迟的限制。


EDIT 1编辑 1

Code sample代码示例

// For pubsunb triggered function
exports.logMessageTopic = (message, context) => {
    console.log("Message Content")
    console.log(Buffer.from(message.data, 'base64').toString())
    console.log("Attribute list")
    for (let key in message.attributes) {
        console.log(key + " -> " + message.attributes[key]);
    };
};


// For push subscription
exports.logMessagePush  = (req, res) => {
    console.log("Message Content")
    console.log(Buffer.from(req.body.message.data, 'base64').toString())
    console.log("Attribute list")
    for (let key in req.body.message.attributes) {
        console.log(key + " -> " + req.body.message.attributes[key]);
    };
};

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM