简体   繁体   English

BigQuery使用PHP传输'insertAll'性能

[英]BigQuery streaming 'insertAll' performance with PHP

We're streaming a high volume of data server-side into BigQuery using the google-api-php-client library. 我们使用google-api-php-client库将大量数据服务器端流式传输到BigQuery中。 The streaming works fine apart from the performance. 除了性能之外,流媒体工作得很好。

Our load testing is giving us an average time of 1000ms (1 sec) to stream one row into BigQuery. 我们的负载测试给了我们平均1000毫秒(1秒)的时间来将一行流入BigQuery。 We can't have the client waiting for more than 200ms. 我们不能让客户等待超过200毫秒。 We've tested with smaller payloads and the time remains the same. 我们用较小的有效载荷进行了测试,时间保持不变。 Async calls on the client side is not an option for us. 客户端的异步调用不是我们的选择。

The 'bottleneck' line of code is: “瓶颈”代码行是:

$service->tabledata->insertAll(PROJECT_NUMBER, DATA_SET, TABLE, $request);

Having looked under the hood of the library the call to insert the row is simply a cURL request ( Curl.php in the library ). 在库的引擎盖下查看插入行的调用只是一个cURL请求( 库中的Curl.php )。

Is there any way to modify the insertAll() to make it faster? 有没有办法修改insertAll()使其更快? We don't care about the result so a fire-and-forget would work for us. 我们不关心结果,所以一场“忘不了”对我们有用。 We've tried setting CURLOPT_CONNECTTIMEOUT_MS and CURLOPT_TIMEOUT_MS in the underlying cCURL request but it does not work. 我们尝试在底层cCURL请求中设置CURLOPT_CONNECTTIMEOUT_MS和CURLOPT_TIMEOUT_MS,但它不起作用。

Reading all your comments, and side notes. 阅读所有评论和旁注。 The approach you've chosen does not scale, and won't scale. 您选择的方法不会扩展,也不会扩展。 You need to rethink the approach with async processes. 您需要使用异步进程重新考虑该方法。

Processing in background IO bound or cpu bound tasks is now a common practice in most web applications. 在大多数Web应用程序中,处理后台IO绑定或cpu绑定任务现在是常见的做法。 There's plenty of software to help build background jobs, some based on a messaging system like Beanstalkd . 有很多软件可以帮助构建后台作业,其中一些基于像Beanstalkd这样的消息传递系统。

Basically, you needed to distribute insert jobs across a closed network, to prioritize them, and consume(run) them. 基本上,您需要在封闭的网络中分配插入作业,确定它们的优先级,然后使用(运行)它们。 Well, that's exactly what Beanstalkd provides. 嗯,这正是Beanstalkd提供的。

Beanstalkd gives the possibility to organize jobs in tubes, each tube corresponding to a job type. Beanstalkd提供了在管中组织作业的可能性,每个管对应于作业类型。

You need an API/producer which can put jobs on a tube, let's say a json representation of the row. 您需要一个可以将作业放在管上的API /生产者,假设该行的json表示。 This was a killer feature for our use case. 这是我们用例的杀手级功能。 So we have an API which gets the rows, and places them on tube, this takes just a few milliseconds, so you could achieve fast response time. 所以我们有一个获取行的API,并将它们放在管上,这只需要几毫秒,因此您可以实现快速响应时间。

On the other part, you have now a bunch of jobs on some tubes. 另一方面,你现在在一些管子上有一堆工作。 You need an agent. 你需要一个代理人。 An agent/consumer can reserve a job. 代理人/消费者可以预约工作。

It helps you also with job management and retries: When a job is successfully processed, a consumer can delete the job from the tube. 它还可以帮助您进行作业管理和重试:成功处理作业后,消费者可以从管中删除作业。 In the case of failure, the consumer can bury the job. 在失败的情况下,消费者可以埋葬这份工作。 This job will not be pushed back to the tube, but will be available for further inspection. 这项工作不会被推回管道,但可供进一步检查。

A consumer can release a job, Beanstalkd will push this job back in the tube, and make it available for another client. 消费者可以释放一份工作,Beanstalkd会将这项工作推回管中,并将其提供给另一个客户。

Beanstalkd clients can be found in most common languages, a web interface can be useful for debugging. Beanstalkd客户端可以在大多数常见语言中找到, Web界面可用于调试。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM