将数据从 BigQuery 提取到 PubSub 的最快方法

Question

At the moment I am going through the GCP docs trying to figure out what is the optimal/fastest way to ingest data from BigQuery (using Python) to PubSub.目前，我正在浏览 GCP 文档，试图找出将数据从 BigQuery（使用 Python）摄取到 PubSub 的最佳/最快方式。 What I am doing so far (in a simplified way) is:到目前为止，我正在做的（以简化的方式）是：

MESSAGE_SIZE_IN_BYTES = 500
MAX_BATCH_MESSAGES = 20
MAX_BYTES_BATCH = MESSAGE_SIZE_IN_BYTES * MAX_BATCH_MESSAGES
BATCH_MAX_LATENCY_IN_10MS = 0.01
MAX_FLOW_MESSAGES = 20
MAX_FLOW_BYTES = MESSAGE_SIZE_IN_BYTES * MAX_FLOW_MESSAGES

batch_settings = pubsub_v1.types.BatchSettings(
    max_messages=MAX_BATCH_MESSAGES,
    max_bytes=MAX_BYTES_BATCH,
    max_latency=BATCH_MAX_LATENCY_IN_10MS,
)
publisher_options = pubsub_v1.types.PublisherOptions(
    flow_control=pubsub_v1.types.PublishFlowControl(
        message_limit=MAX_FLOW_MESSAGES,
        byte_limit=MAX_FLOW_BYTES,
        limit_exceeded_behavior=pubsub_v1.types.LimitExceededBehavior.BLOCK,
    ),
)
pubsub_client = pubsub_v1.PublisherClient(credentials=credentials,
                                 batch_settings=self.batch_settings,       
             publisher_options=self.publisher_options)

bigquery_client = ....

bq_query_job = bigquery_client.query(QUERY)
rows = bq_query_job.result()
for row in rows:
    callback_obj = PubsubCallback(...)
    json_data = json.dumps(row).encode("utf-8")
    publish_future = pubsub_client.publish(topic_path, json_data)
    publish_future.add_done_callback(callback_obj.callback)
    publish_futures.append(publish_future)

so one message per row.所以每行一条消息。 I have being trying to tweak different params for the PubSub publisher client etc, but I cannot get further than 20/30 messages(rows) per second.我一直在尝试为 PubSub 发布者客户端等调整不同的参数，但我每秒无法获得超过 20/30 条消息（行）。 Is there a way to read from BigQuery using Pubsub in a faster way (at least 1000 times faster than now)?有没有一种方法可以更快地使用 Pubsub 从 BigQuery 中读取数据（至少比现在快 1000 倍）？

Answer 1

We also have a need to get data from BigQuery into PubSub and we do so using Dataflow.我们还需要将数据从 BigQuery 获取到 PubSub，我们使用 Dataflow 来实现。 I've just looked at one of the jobs we ran today and we loaded 3.4million rows in about 5 minutes (so ~11000 rows per second).我刚刚查看了我们今天运行的一项作业，我们在大约 5 分钟内加载了 340 万行（因此每秒约 11000 行）。

Our Dataflow jobs are written in java but you could write them in python if you wish.我们的 Dataflow 作业是用 java 编写的，但如果您愿意，也可以用 python 编写它们。 Here is the code for the pipeline I described above:这是我上面描述的管道的代码：

package com.ourcompany.pipelines;

import com.google.api.services.bigquery.model.TableRow;
import java.util.HashMap;
import java.util.Map;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Validation.Required;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@code BigQueryEventReplayer} pipeline runs a supplied SQL query
 * against BigQuery, and sends the results one-by-one to PubSub
 * The query MUST return a column named 'json', it is this column
 * (and ONLY this column) that will be sent onward. The column must be a String type
 * and should be valid JSON.
 */
public class BigQueryEventReplayer {

  private static final Logger logger = LoggerFactory.getLogger(BigQueryEventReplayer.class);

  /**
   * Options for the BigQueryEventReplayer. See descriptions for more info
   */
  public interface Options extends PipelineOptions {
    @Description("SQL query to be run."
        + "An SQL string literal which will be run 'as is'")
    @Required
    ValueProvider<String> getBigQuerySql();

    void setBigQuerySql(ValueProvider<String> value);

    @Description("The name of the topic which data should be published to. "
        + "The name should be in the format of projects/<project-id>/topics/<topic-name>.")
    @Required
    ValueProvider<String> getOutputTopic();

    void setOutputTopic(ValueProvider<String> value);

    @Description("The ID of the BigQuery dataset targeted by the event")
    @Required
    ValueProvider<String> getBigQueryTargetDataset();

    void setBigQueryTargetDataset(ValueProvider<String> value);

    @Description("The ID of the BigQuery table targeted by the event")
    @Required
    ValueProvider<String> getBigQueryTargetTable();

    void setBigQueryTargetTable(ValueProvider<String> value);

    @Description("The SourceSystem attribute of the event")
    @Required
    ValueProvider<String> getSourceSystem();

    void setSourceSystem(ValueProvider<String> value);

  }

  /**
   * Takes the data from the TableRow and prepares it for the PubSub, including
   * adding attributes to ensure the payload is routed correctly.
   */
  public static class MapQueryToPubsub extends DoFn<TableRow, PubsubMessage> {
    private final ValueProvider<String> targetDataset;
    private final ValueProvider<String> targetTable;
    private final ValueProvider<String> sourceSystem;

    MapQueryToPubsub(
        ValueProvider<String> targetDataset, 
        ValueProvider<String> targetTable, 
        ValueProvider<String> sourceSystem) {
      this.targetDataset = targetDataset;
      this.targetTable = targetTable;
      this.sourceSystem = sourceSystem;
    }

    /**
     * Entry point of DoFn for Dataflow.
     */
    @ProcessElement
    public void processElement(ProcessContext c) {
      TableRow row = c.element();
      if (!row.containsKey("json")) {
        logger.warn("table does not contain column named 'json'");
      }
      Map<String, String> attributes = new HashMap<>();
      attributes.put("sourceSystem", sourceSystem.get());
      attributes.put("targetDataset", targetDataset.get());
      attributes.put("targetTable", targetTable.get());
      String json = (String) row.get("json");
      c.output(new PubsubMessage(json.getBytes(), attributes));
    }
  }

  /**
   * Run the pipeline. This is the entrypoint for running 'locally'
   */
  public static void main(String[] args) {
    // Parse the user options passed from the command-line
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
    run(options);
  }

  /**
   * Run the pipeline. This is the entrypoint that GCP will use
   */
  public static PipelineResult run(Options options) {

    Pipeline pipeline = Pipeline.create(options);

    pipeline.apply("Read from BigQuery query",
        BigQueryIO.readTableRows().fromQuery(options.getBigQuerySql()).usingStandardSql().withoutValidation()
            .withTemplateCompatibility())
        .apply("Map data to PubsubMessage",
            ParDo.of(
                new MapQueryToPubsub(
                    options.getBigQueryTargetDataset(),
                    options.getBigQueryTargetTable(),
                    options.getSourceSystem()
                )
            )
        )
        .apply("Write message to PubSub", PubsubIO.writeMessages().to(options.getOutputTopic()));

    return pipeline.run();
  }
}

This pipeline requires that each row retrieved from BigQuery is a JSON document, something that can easily be achieved using TO_JSON_STRING .此管道要求从 BigQuery 检索到的每一行都是一个 JSON 文档，这可以使用TO_JSON_STRING轻松实现。

I know this might look rather daunting to some (it kinda does to me I admit) but it will get you the throughput that you require!我知道这对某些人来说可能看起来相当令人生畏（我承认这对我有点影响），但它会为您提供所需的吞吐量！

You can ignore this part:你可以忽略这部分：

      Map<String, String> attributes = new HashMap<>();
      attributes.put("sourceSystem", sourceSystem.get());
      attributes.put("targetDataset", targetDataset.get());
      attributes.put("targetTable", targetTable.get());

that's just some extra attributes we add to the pubsub message purely for our own use.这只是我们添加到 pubsub 消息中的一些额外属性，纯粹供我们自己使用。

Answer 2

Use Pub/Sub Batch Messages.使用 Pub/Sub 批处理消息。 This allows your code to batch multiple messages into a single call to the Pub/Sub service.这允许您的代码将多条消息批处理到对 Pub/Sub 服务的单个调用中。

Example code from Google ( link ):来自 Google 的示例代码（链接）：

from concurrent import futures
from google.cloud import pubsub_v1

# TODO(developer)
# project_id = "your-project-id"
# topic_id = "your-topic-id"

# Configure the batch to publish as soon as there are 10 messages
# or 1 KiB of data, or 1 second has passed.
batch_settings = pubsub_v1.types.BatchSettings(
    max_messages=10,  # default 100
    max_bytes=1024,  # default 1 MB
    max_latency=1,  # default 10 ms
)
publisher = pubsub_v1.PublisherClient(batch_settings)
topic_path = publisher.topic_path(project_id, topic_id)
publish_futures = []

# Resolve the publish future in a separate thread.
def callback(future: pubsub_v1.publisher.futures.Future) -> None:
    message_id = future.result()
    print(message_id)

for n in range(1, 10):
    data_str = f"Message number {n}"
    # Data must be a bytestring
    data = data_str.encode("utf-8")
    publish_future = publisher.publish(topic_path, data)
    # Non-blocking. Allow the publisher client to batch multiple messages.
    publish_future.add_done_callback(callback)
    publish_futures.append(publish_future)

futures.wait(publish_futures, return_when=futures.ALL_COMPLETED)

print(f"Published messages with batch settings to {topic_path}.")

将数据从 BigQuery 提取到 PubSub 的最快方法

问题描述

2 个解决方案

解决方案1
1 2022-07-05 21:39:53

解决方案2
1 2022-07-05 21:57:30

将数据从 BigQuery 提取到 PubSub 的最快方法

问题描述

2 个解决方案

解决方案1 1 2022-07-05 21:39:53

解决方案2 1 2022-07-05 21:57:30

解决方案1
1 2022-07-05 21:39:53

解决方案2
1 2022-07-05 21:57:30