如何使用 BigQuery Storage API 並發讀取 Python 線程中的流

Question

我有一個大表（BigQuery 外部，因為數據在 Google Cloud Storage 中）。 我想使用 BigQuery 將表掃描到客戶端計算機。 對於吞吐量，我在多個線程中同時獲取多個流。

據我所知，並發不起作用。 使用多個線程時實際上會有一些懲罰。

    import concurrent.futures
    import logging
    import queue
    import threading
    import time
    
    from google.cloud.bigquery_storage import types
    from google.cloud import bigquery_storage
    
    PROJECT_ID = 'abc'
    CREDENTIALS = {....}
    
    
    def main():
        table = "projects/{}/datasets/{}/tables/{}".format(PROJECT_ID, 'db', 'tb')
    
        requested_session = types.ReadSession()
        requested_session.table = table
        requested_session.data_format = types.DataFormat.AVRO
        requested_session.read_options.selected_fields = ["a", "b"]
        requested_session.read_options
    
        client = bigquery_storage.BigQueryReadClient(credentials=CREDENTIALS)
        session = client.create_read_session(
            parent="projects/{}".format(PROJECT_ID),
            read_session=requested_session,
            max_stream_count=0,
        )
    
        if not session.streams:
            return
    
        n_streams = len(session.streams)
        print("Total streams", n_streams)  # this prints 1000
    
        q_out = queue.Queue(1024)
        concurrency = 4
    
        with concurrent.futures.ThreadPoolExecutor(concurrency) as pool:
            tasks = [
                pool.submit(download_row,
                            client._transport.__class__,
                            client._transport._grpc_channel,
                            s.name,
                            q_out)
                for s in session.streams
            ]
    
            t0 = time.perf_counter()
            ntotal = 0
            ndone = 0
            while True:
                page = q_out.get()
                if page is None:
                    ndone += 1
                    if ndone == len(tasks):
                        break
                else:
                    for row in page:
                        ntotal += 1
                        if ntotal % 10000 == 0:
                            qps = int(ntotal / (time.perf_counter() - t0))
                            print(f'QPS so far:  {qps}')
    
            for t in tasks:
                t.result()
    
    
    def download_row(transport_cls, channel, stream_name, q_out):
        try:
            transport = transport_cls(channel=channel)
            client = bigquery_storage.BigQueryReadClient(
                transport=transport,
                )
            reader = client.read_rows(stream_name)
            for page in reader.rows().pages:
                q_out.put(page)
        finally:
            q_out.put(None)
    
    
    if __name__ == '__main__':
        main()

Google BigQuery Storage API 文檔和多個來源聲稱可以同時獲取多個“流”以獲得更高的吞吐量，但我沒有找到任何功能示例。 我已按照建議在線程之間共享 GRPC“通道”。

數據項很大。 我得到的QPS大概是

150, concurrency=1
120, concurrency=2
140, concurrency=4

每個“頁面”包含大約 200 行。

想法：

BigQuery 配額？ 我只看到了請求速率限制，沒有看到每秒數據流量的限制。 配額似乎並沒有限制我的情況。
BigQuery 服務器端選項？ 似乎不相關。 BigQuery 應該接受具有足夠能力的並發請求。
GPRC 用法？ 我認為這是挖掘的主要方向。 但我不知道我的代碼有什么問題。

任何人都可以對此有所了解嗎？ 謝謝。

Answer 1

由於GIL ，Python 線程不會並行運行。

您正在創建線程，而不是多進程。 根據 GIL 的定義，Python 是單核的。

ThreadPoolExecutor從 Python 3.2 開始就可以使用了，沒有被廣泛使用，可能是因為對 Python 中 Threads 的能力和限制的誤解。 這是由全局解釋器鎖（“GIL”）強制執行的。 更多的

看看使用multiprocessing模塊，一個很好的閱讀是here 。

更新：

同樣在您的代碼中，您還需要一個參數： requested_streams

n_streams = 2
session = client.create_read_session(
    table_ref,
    parent,
    requested_streams=n_streams,
    format_=bigquery_storage_v1beta1.enums.DataFormat.ARROW,
    sharding_strategy=(bigquery_storage_v1beta1.enums.ShardingStrategy.BALANCED),
)

如何使用 BigQuery Storage API 並發讀取 Python 線程中的流

問題描述

1 個解決方案

解決方案1
0 2022-08-17 19:43:24

如何使用 BigQuery Storage API 並發讀取 Python 線程中的流

問題描述

1 個解決方案

解決方案1 0 2022-08-17 19:43:24

解決方案1
0 2022-08-17 19:43:24