如何使用 BigQuery Storage API 并发读取 Python 线程中的流

Question

I have a large table (external to BigQuery as the data is in Google Cloud Storage).我有一个大表（BigQuery 外部，因为数据在 Google Cloud Storage 中）。 I want to scan the table using BigQuery to a client machine.我想使用 BigQuery 将表扫描到客户端计算机。 For throughput, I fetch multiple streams concurrently in multiple threads.对于吞吐量，我在多个线程中同时获取多个流。

From all I can tell, concurrency is not working.据我所知，并发不起作用。 There's actually some penalty when using multiple threads.使用多个线程时实际上会有一些惩罚。

    import concurrent.futures
    import logging
    import queue
    import threading
    import time
    
    from google.cloud.bigquery_storage import types
    from google.cloud import bigquery_storage
    
    PROJECT_ID = 'abc'
    CREDENTIALS = {....}
    
    
    def main():
        table = "projects/{}/datasets/{}/tables/{}".format(PROJECT_ID, 'db', 'tb')
    
        requested_session = types.ReadSession()
        requested_session.table = table
        requested_session.data_format = types.DataFormat.AVRO
        requested_session.read_options.selected_fields = ["a", "b"]
        requested_session.read_options
    
        client = bigquery_storage.BigQueryReadClient(credentials=CREDENTIALS)
        session = client.create_read_session(
            parent="projects/{}".format(PROJECT_ID),
            read_session=requested_session,
            max_stream_count=0,
        )
    
        if not session.streams:
            return
    
        n_streams = len(session.streams)
        print("Total streams", n_streams)  # this prints 1000
    
        q_out = queue.Queue(1024)
        concurrency = 4
    
        with concurrent.futures.ThreadPoolExecutor(concurrency) as pool:
            tasks = [
                pool.submit(download_row,
                            client._transport.__class__,
                            client._transport._grpc_channel,
                            s.name,
                            q_out)
                for s in session.streams
            ]
    
            t0 = time.perf_counter()
            ntotal = 0
            ndone = 0
            while True:
                page = q_out.get()
                if page is None:
                    ndone += 1
                    if ndone == len(tasks):
                        break
                else:
                    for row in page:
                        ntotal += 1
                        if ntotal % 10000 == 0:
                            qps = int(ntotal / (time.perf_counter() - t0))
                            print(f'QPS so far:  {qps}')
    
            for t in tasks:
                t.result()
    
    
    def download_row(transport_cls, channel, stream_name, q_out):
        try:
            transport = transport_cls(channel=channel)
            client = bigquery_storage.BigQueryReadClient(
                transport=transport,
                )
            reader = client.read_rows(stream_name)
            for page in reader.rows().pages:
                q_out.put(page)
        finally:
            q_out.put(None)
    
    
    if __name__ == '__main__':
        main()

Google BigQuery Storage API doc and multiple source claim one can fetch multiple "streams" concurrently for higher throughput, yet I didn't find any functional example. Google BigQuery Storage API 文档和多个来源声称可以同时获取多个“流”以获得更高的吞吐量，但我没有找到任何功能示例。 I've followed the advice to share a GRPC "channel" across the threads.我已按照建议在线程之间共享 GRPC“通道”。

The data items are large.数据项很大。 The QPS I got is roughly我得到的QPS大概是

150, concurrency=1
120, concurrency=2
140, concurrency=4

Each "page" contains about 200 rows.每个“页面”包含大约 200 行。

Thoughts:想法：

BigQuery quota? BigQuery 配额？ I only saw request rate limit, and did not see limit on volume of data traffic per second.我只看到了请求速率限制，没有看到每秒数据流量的限制。 The quotas do not appear to be limiting for my case.配额似乎并没有限制我的情况。
BigQuery server side options? BigQuery 服务器端选项？ Doesn't seem to be relevant.似乎不相关。 BigQuery should accept concurrent requests with enough capability. BigQuery 应该接受具有足够能力的并发请求。
GPRC usage? GPRC 用法？ I think this is the main direction for digging.我认为这是挖掘的主要方向。 But I don't know what's wrong in my code.但我不知道我的代码有什么问题。

Can anyone shed some light on this?任何人都可以对此有所了解吗？ Thanks.谢谢。

Answer 1

Python threads do not run in parallel because of the GIL .由于GIL ，Python 线程不会并行运行。

You are creating threads, and not multiprocesses.您正在创建线程，而不是多进程。 And by definition Python is single core because of GIL.根据 GIL 的定义，Python 是单核的。

ThreadPoolExecutor has been available since Python 3.2, it is not widely used, perhaps because of misunderstandings of the capabilities and limitations of Threads in Python. ThreadPoolExecutor从 Python 3.2 开始就可以使用了，没有被广泛使用，可能是因为对 Python 中 Threads 的能力和限制的误解。 This is enforced by the Global Interpreter Lock ("GIL").这是由全局解释器锁（“GIL”）强制执行的。 More更多的

Look into using multiprocessing module, a good read is here .看看使用multiprocessing模块，一个很好的阅读是here 。

UPDATE :更新：

Also in your code you need one more param: requested_streams同样在您的代码中，您还需要一个参数： requested_streams

n_streams = 2
session = client.create_read_session(
    table_ref,
    parent,
    requested_streams=n_streams,
    format_=bigquery_storage_v1beta1.enums.DataFormat.ARROW,
    sharding_strategy=(bigquery_storage_v1beta1.enums.ShardingStrategy.BALANCED),
)

如何使用 BigQuery Storage API 并发读取 Python 线程中的流

问题描述

1 个解决方案

解决方案1
0 2022-08-17 19:43:24

如何使用 BigQuery Storage API 并发读取 Python 线程中的流

问题描述

1 个解决方案

解决方案1 0 2022-08-17 19:43:24

解决方案1
0 2022-08-17 19:43:24