[英]How to use BigQuery Storage API to concurrently read streams in Python threads
I have a large table (external to BigQuery as the data is in Google Cloud Storage).我有一个大表(BigQuery 外部,因为数据在 Google Cloud Storage 中)。 I want to scan the table using BigQuery to a client machine.
我想使用 BigQuery 将表扫描到客户端计算机。 For throughput, I fetch multiple streams concurrently in multiple threads.
对于吞吐量,我在多个线程中同时获取多个流。
From all I can tell, concurrency is not working.据我所知,并发不起作用。 There's actually some penalty when using multiple threads.
使用多个线程时实际上会有一些惩罚。
import concurrent.futures
import logging
import queue
import threading
import time
from google.cloud.bigquery_storage import types
from google.cloud import bigquery_storage
PROJECT_ID = 'abc'
CREDENTIALS = {....}
def main():
table = "projects/{}/datasets/{}/tables/{}".format(PROJECT_ID, 'db', 'tb')
requested_session = types.ReadSession()
requested_session.table = table
requested_session.data_format = types.DataFormat.AVRO
requested_session.read_options.selected_fields = ["a", "b"]
requested_session.read_options
client = bigquery_storage.BigQueryReadClient(credentials=CREDENTIALS)
session = client.create_read_session(
parent="projects/{}".format(PROJECT_ID),
read_session=requested_session,
max_stream_count=0,
)
if not session.streams:
return
n_streams = len(session.streams)
print("Total streams", n_streams) # this prints 1000
q_out = queue.Queue(1024)
concurrency = 4
with concurrent.futures.ThreadPoolExecutor(concurrency) as pool:
tasks = [
pool.submit(download_row,
client._transport.__class__,
client._transport._grpc_channel,
s.name,
q_out)
for s in session.streams
]
t0 = time.perf_counter()
ntotal = 0
ndone = 0
while True:
page = q_out.get()
if page is None:
ndone += 1
if ndone == len(tasks):
break
else:
for row in page:
ntotal += 1
if ntotal % 10000 == 0:
qps = int(ntotal / (time.perf_counter() - t0))
print(f'QPS so far: {qps}')
for t in tasks:
t.result()
def download_row(transport_cls, channel, stream_name, q_out):
try:
transport = transport_cls(channel=channel)
client = bigquery_storage.BigQueryReadClient(
transport=transport,
)
reader = client.read_rows(stream_name)
for page in reader.rows().pages:
q_out.put(page)
finally:
q_out.put(None)
if __name__ == '__main__':
main()
Google BigQuery Storage API doc and multiple source claim one can fetch multiple "streams" concurrently for higher throughput, yet I didn't find any functional example. Google BigQuery Storage API 文档和多个来源声称可以同时获取多个“流”以获得更高的吞吐量,但我没有找到任何功能示例。 I've followed the advice to share a GRPC "channel" across the threads.
我已按照建议在线程之间共享 GRPC“通道”。
The data items are large.数据项很大。 The QPS I got is roughly
我得到的QPS大概是
150, concurrency=1
120, concurrency=2
140, concurrency=4
Each "page" contains about 200 rows.每个“页面”包含大约 200 行。
Thoughts:想法:
BigQuery quota? BigQuery 配额? I only saw request rate limit, and did not see limit on volume of data traffic per second.
我只看到了请求速率限制,没有看到每秒数据流量的限制。 The quotas do not appear to be limiting for my case.
配额似乎并没有限制我的情况。
BigQuery server side options? BigQuery 服务器端选项? Doesn't seem to be relevant.
似乎不相关。 BigQuery should accept concurrent requests with enough capability.
BigQuery 应该接受具有足够能力的并发请求。
GPRC usage? GPRC 用法? I think this is the main direction for digging.
我认为这是挖掘的主要方向。 But I don't know what's wrong in my code.
但我不知道我的代码有什么问题。
Can anyone shed some light on this?任何人都可以对此有所了解吗? Thanks.
谢谢。
Python threads do not run in parallel because of the GIL .由于GIL ,Python 线程不会并行运行。
You are creating threads, and not multiprocesses.您正在创建线程,而不是多进程。 And by definition Python is single core because of GIL.
根据 GIL 的定义,Python 是单核的。
ThreadPoolExecutor
has been available since Python 3.2, it is not widely used, perhaps because of misunderstandings of the capabilities and limitations of Threads in Python.ThreadPoolExecutor
从 Python 3.2 开始就可以使用了,没有被广泛使用,可能是因为对 Python 中 Threads 的能力和限制的误解。 This is enforced by the Global Interpreter Lock ("GIL").这是由全局解释器锁(“GIL”)强制执行的。 More
更多的
Look into using multiprocessing
module, a good read is here .看看使用
multiprocessing
模块,一个很好的阅读是here 。
UPDATE :更新:
Also in your code you need one more param: requested_streams
同样在您的代码中,您还需要一个参数:
requested_streams
n_streams = 2
session = client.create_read_session(
table_ref,
parent,
requested_streams=n_streams,
format_=bigquery_storage_v1beta1.enums.DataFormat.ARROW,
sharding_strategy=(bigquery_storage_v1beta1.enums.ShardingStrategy.BALANCED),
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.