简体   繁体   中英

GCP - Message from PubSub to BigQuery

I need to get the data from my pubsub message and insert into bigquery.

What I have:

const topicName = "-----topic-name-----";
const data = JSON.stringify({ foo: "bar" });

// Imports the Google Cloud client library
const { PubSub } = require("@google-cloud/pubsub");

// Creates a client; cache this for further use
const pubSubClient = new PubSub();

async function publishMessageWithCustomAttributes() {
  // Publishes the message as a string, e.g. "Hello, world!" or JSON.stringify(someObject)
  const dataBuffer = Buffer.from(data);

  // Add two custom attributes, origin and username, to the message
  const customAttributes = {
    origin: "nodejs-sample",
    username: "gcp",
  };

  const messageId = await pubSubClient
    .topic(topicName)
    .publish(dataBuffer, customAttributes);
  console.log(`Message ${messageId} published.`);
}

publishMessageWithCustomAttributes().catch(console.error);

I need to get the data/attributes from this message and query in BigQuery, anyone can help me?

Thaks in advance!

In fact, there is 2 solutions to consume the messages: either a message per message, or in bulk.

Firstly, before going in detail, and because you will perform BigQuery calls (or Facebook API calls), you will spend a lot of the processing time to wait the API response.


  • Message per Message If you have an acceptable volume of message, you can perform a message per message processing. You have 2 solutions here:
  1. You can handle each message with Cloud Functions. Set the minimal amount of memory to the functions (128Mb) to limit the CPU cost and thus the global cost. Indeed, because you will wait a lot, don't spend expensive CPU cost to do nothing, Ok, you will process slowly the data when they will be there but. it's a tradeoff.

Create Cloud Function on the topic , or a Push Subscription to call a HTTP triggered Cloud Functions

  1. You can also handle request concurrently with Cloud Run . Cloud Run can handle up to 250 requests concurrently (in preview), and because you will wait a lot, it's perfectly suitable. If you need more CPU and memory, you can increase these value to 4CPU and 8Gb of memory. It's my preferred solution.

  • Bulk processing is possible if you are able to easily manage multi-cpu multi-(light)thread development. It's easy in Go. Concurrency in Node is also easy (await/async) but I don't know if it's multi-cpu capable or only single-cpu. Anyway, the principle is the following
  1. Create a pull subscription on PubSub topic
  2. Create a Cloud Run (better for multi-cpu, but also work with App Engine or Cloud Functions) that will listen the pull subscription for a while (let's say 10 minutes)
  3. For each message pulled, an async process is performed: get the data/attribute, make the call to BigQuery, ack the message
  4. After the timeout of the pull connexion, close the message listening, finish the current message processing and exit gracefully (return 200 HTTP code)
  5. Create a Cloud Scheduler that call every 10 minutes the Cloud Run service. Set the timeout to 15 minutes and discard retries.
  6. Deploy the Cloud Run service with a timeout of 15 minutes.

This solution offers a better message throughput processing (you can process more than 250 message per Cloud Run service), but don't have a real advantage because you are limited by the API call latency.


EDIT 1

Code sample

// For pubsunb triggered function
exports.logMessageTopic = (message, context) => {
    console.log("Message Content")
    console.log(Buffer.from(message.data, 'base64').toString())
    console.log("Attribute list")
    for (let key in message.attributes) {
        console.log(key + " -> " + message.attributes[key]);
    };
};


// For push subscription
exports.logMessagePush  = (req, res) => {
    console.log("Message Content")
    console.log(Buffer.from(req.body.message.data, 'base64').toString())
    console.log("Attribute list")
    for (let key in req.body.message.attributes) {
        console.log(key + " -> " + req.body.message.attributes[key]);
    };
};

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM