简体   繁体   English

如何使用 BigQuery Storage API 并发读取 Python 线程中的流

[英]How to use BigQuery Storage API to concurrently read streams in Python threads

I have a large table (external to BigQuery as the data is in Google Cloud Storage).我有一个大表(BigQuery 外部,因为数据在 Google Cloud Storage 中)。 I want to scan the table using BigQuery to a client machine.我想使用 BigQuery 将表扫描到客户端计算机。 For throughput, I fetch multiple streams concurrently in multiple threads.对于吞吐量,我在多个线程中同时获取多个流。

From all I can tell, concurrency is not working.据我所知,并发不起作用。 There's actually some penalty when using multiple threads.使用多个线程时实际上会有一些惩罚。

    import concurrent.futures
    import logging
    import queue
    import threading
    import time
    
    from google.cloud.bigquery_storage import types
    from google.cloud import bigquery_storage
    
    PROJECT_ID = 'abc'
    CREDENTIALS = {....}
    
    
    def main():
        table = "projects/{}/datasets/{}/tables/{}".format(PROJECT_ID, 'db', 'tb')
    
        requested_session = types.ReadSession()
        requested_session.table = table
        requested_session.data_format = types.DataFormat.AVRO
        requested_session.read_options.selected_fields = ["a", "b"]
        requested_session.read_options
    
        client = bigquery_storage.BigQueryReadClient(credentials=CREDENTIALS)
        session = client.create_read_session(
            parent="projects/{}".format(PROJECT_ID),
            read_session=requested_session,
            max_stream_count=0,
        )
    
        if not session.streams:
            return
    
        n_streams = len(session.streams)
        print("Total streams", n_streams)  # this prints 1000
    
        q_out = queue.Queue(1024)
        concurrency = 4
    
        with concurrent.futures.ThreadPoolExecutor(concurrency) as pool:
            tasks = [
                pool.submit(download_row,
                            client._transport.__class__,
                            client._transport._grpc_channel,
                            s.name,
                            q_out)
                for s in session.streams
            ]
    
            t0 = time.perf_counter()
            ntotal = 0
            ndone = 0
            while True:
                page = q_out.get()
                if page is None:
                    ndone += 1
                    if ndone == len(tasks):
                        break
                else:
                    for row in page:
                        ntotal += 1
                        if ntotal % 10000 == 0:
                            qps = int(ntotal / (time.perf_counter() - t0))
                            print(f'QPS so far:  {qps}')
    
            for t in tasks:
                t.result()
    
    
    def download_row(transport_cls, channel, stream_name, q_out):
        try:
            transport = transport_cls(channel=channel)
            client = bigquery_storage.BigQueryReadClient(
                transport=transport,
                )
            reader = client.read_rows(stream_name)
            for page in reader.rows().pages:
                q_out.put(page)
        finally:
            q_out.put(None)
    
    
    if __name__ == '__main__':
        main()

Google BigQuery Storage API doc and multiple source claim one can fetch multiple "streams" concurrently for higher throughput, yet I didn't find any functional example. Google BigQuery Storage API 文档和多个来源声称可以同时获取多个“流”以获得更高的吞吐量,但我没有找到任何功能示例。 I've followed the advice to share a GRPC "channel" across the threads.我已按照建议在线程之间共享 GRPC“通道”。

The data items are large.数据项很大。 The QPS I got is roughly我得到的QPS大概是

150, concurrency=1
120, concurrency=2
140, concurrency=4

Each "page" contains about 200 rows.每个“页面”包含大约 200 行。

Thoughts:想法:

  1. BigQuery quota? BigQuery 配额? I only saw request rate limit, and did not see limit on volume of data traffic per second.我只看到了请求速率限制,没有看到每秒数据流量的限制。 The quotas do not appear to be limiting for my case.配额似乎并没有限制我的情况。

  2. BigQuery server side options? BigQuery 服务器端选项? Doesn't seem to be relevant.似乎不相关。 BigQuery should accept concurrent requests with enough capability. BigQuery 应该接受具有足够能力的并发请求。

  3. GPRC usage? GPRC 用法? I think this is the main direction for digging.我认为这是挖掘的主要方向。 But I don't know what's wrong in my code.但我不知道我的代码有什么问题。

Can anyone shed some light on this?任何人都可以对此有所了解吗? Thanks.谢谢。

Python threads do not run in parallel because of the GIL .由于GIL ,Python 线程不会并行运行。

You are creating threads, and not multiprocesses.您正在创建线程,而不是多进程。 And by definition Python is single core because of GIL.根据 GIL 的定义,Python 是单核的。

ThreadPoolExecutor has been available since Python 3.2, it is not widely used, perhaps because of misunderstandings of the capabilities and limitations of Threads in Python. ThreadPoolExecutor从 Python 3.2 开始就可以使用了,没有被广泛使用,可能是因为对 Python 中 Threads 的能力和限制的误解。 This is enforced by the Global Interpreter Lock ("GIL").这是由全局解释器锁(“GIL”)强制执行的。 More更多的

Look into using multiprocessing module, a good read is here .看看使用multiprocessing模块,一个很好的阅读是here

UPDATE :更新

Also in your code you need one more param: requested_streams同样在您的代码中,您还需要一个参数: requested_streams

n_streams = 2
session = client.create_read_session(
    table_ref,
    parent,
    requested_streams=n_streams,
    format_=bigquery_storage_v1beta1.enums.DataFormat.ARROW,
    sharding_strategy=(bigquery_storage_v1beta1.enums.ShardingStrategy.BALANCED),
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 session 超时后从 BigQuery Storage API 恢复读取 - How to resume reading from BigQuery Storage API upon session timeout 如何查看 BigQuery Storage API 日志? - How can I see BigQuery Storage API Logs? 避免 session 关闭 BigQuery 存储 API 与 Dataflow - Avoid session shutdown on BigQuery Storage API with Dataflow 使用 BigQuery 存储时 golang 中的 BigQuery 可为空类型写入 API - BigQuery nullable types in golang when using BigQuery storage write API 如何在 BigQuery 中使用 CONCAT 和 + - How to use CONCAT with + in BigQuery 使用 BigQuery 存储写入 API 的 Google 数据流存储到特定分区 - Google Dataflow store to specific Partition using BigQuery Storage Write API 如何使用 Python API 在 Google Cloud Storage 上上传文件夹 - How to upload folder on Google Cloud Storage using Python API 如何将 BigQuery 视图作为 csv 文件传输到 Google Cloud Storage 存储桶 - How to Transfer a BigQuery view to a Google Cloud Storage bucket as a csv file 插入数据时使用 Python UUID 而不是 BigQuery GENERATE_UUID() 有多安全? - How safe to use Python UUID instead of BigQuery GENERATE_UUID() when inserting data? 如何在 BigQuery 列名中使用特殊字符? - How to use special characters in BigQuery column names?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM