[英]BigQuery Java function timeout
I am running a series of calculations on ~120 tables.我正在对约 120 个表进行一系列计算。
Algorithm: Read final table find where day=something and append the results to an existing table算法:读取最终表查找 where day=something 和 append 的结果到现有表
For table2 in tablearray
{
query = "SELECT * " +
" FROM (SELECT * FROM tablefinal where day=someday) AS tmp " +
"WHERE NOT EXISTS (" +
"SELECT * FROM table2 t2 where t2.user_id=tmp.user_id AND t2.day=tmp.day);"
then append rows to table2...
QueryJobConfiguration queryConfig =
QueryJobConfiguration.newBuilder(query).setDestinationTable(destinationTable)
.setWriteDisposition(JobInfo.WriteDisposition.WRITE_APPEND).build();
....
bigquery.query(queryConfig);
}
do that with 120 table still using finaltable as data source.使用 120 表仍然使用 finaltable 作为数据源来做到这一点。
Issue.问题。 Is taking too long.
耗时太长。 Like 700 sec.
像 700 秒。 And after 540 sec the google function dies.
540 秒后,谷歌 function 死机。 The results are correct, but it is too slow...
结果是正确的,但是速度太慢了……
Question how can speed this up (The tables are pretty small ~100K rows)?质疑如何加快速度(这些表非常小~100K 行)? Can I send a bunch of queries in parallel?
我可以并行发送一堆查询吗?
You can run the request concurrently in your code to parallelize the request but it's not the optimal way.您可以在代码中同时运行请求以并行化请求,但这不是最佳方式。
Personally, I recommend you to publish a PubSub message for each query to run.就个人而言,我建议您为要运行的每个查询发布一条 PubSub 消息。 So, here the process:
所以,这里的过程:
The process is scalable (is tomorrow you have 250 tables, no worry, it scales.!).这个过程是可扩展的(明天你有 250 个表,不用担心,它是可扩展的。!)。 Don't forget to set the retry parameter on the second Cloud Functions because parallel requests at high rate to BigQuery can create transient issues
不要忘记在第二个 Cloud Functions 上设置重试参数,因为对 BigQuery 的高速并行请求可能会产生暂时性问题
Try Magnus - Workflow Automator - you can find it on GCP Marketplace - as a part of Potens.io Suite of Tools for BigQuery试试Magnus - Workflow Automator - 您可以在 GCP Marketplace 上找到它 - 作为用于 BigQuery 的Potens.io工具套件的一部分
In Magnus you can easily build workflow with Loop Task that will take a list of tables and for each will execute whatever query/statement you need.在 Magnus 中,您可以轻松地使用 Loop Task 构建工作流,该工作流将获取一个表列表,并为每个表执行您需要的任何查询/语句。
You can set Loop Task to be executed in Async Mode - so all tables will be process in parallel您可以将循环任务设置为在异步模式下执行 - 因此所有表都将并行处理
As you can see you can even control how many concurrent iterations you allow - so you respect respective API limits如您所见,您甚至可以控制允许的并发迭代次数 - 因此您尊重各自的 API 限制
Check out https://potensio.zendesk.com for more details查看https://potensio.zendesk.com了解更多详情
Disclosure: I am author and lead on Potens.io Project by Viant Tech披露:我是 Viant Tech 的 Potens.io 项目的作者和领导者
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.