简体   繁体   English

BigQuery Java function 超时

[英]BigQuery Java function timeout

I am running a series of calculations on ~120 tables.我正在对约 120 个表进行一系列计算。

Algorithm: Read final table find where day=something and append the results to an existing table算法:读取最终表查找 where day=something 和 append 的结果到现有表

    For table2 in tablearray
    {
    query = "SELECT * " +
    " FROM (SELECT * FROM tablefinal where day=someday) AS tmp " +
    "WHERE NOT EXISTS (" +
    "SELECT * FROM table2 t2 where  t2.user_id=tmp.user_id AND  t2.day=tmp.day);"
    
    then append rows to table2...
     QueryJobConfiguration queryConfig =
                    QueryJobConfiguration.newBuilder(query).setDestinationTable(destinationTable)
          .setWriteDisposition(JobInfo.WriteDisposition.WRITE_APPEND).build();
....

  bigquery.query(queryConfig);

    }

do that with 120 table still using finaltable as data source.使用 120 表仍然使用 finaltable 作为数据源来做到这一点。

Issue.问题。 Is taking too long.耗时太长。 Like 700 sec.像 700 秒。 And after 540 sec the google function dies. 540 秒后,谷歌 function 死机。 The results are correct, but it is too slow...结果是正确的,但是速度太慢了……

Question how can speed this up (The tables are pretty small ~100K rows)?质疑如何加快速度(这些表非常小~100K 行)? Can I send a bunch of queries in parallel?我可以并行发送一堆查询吗?

You can run the request concurrently in your code to parallelize the request but it's not the optimal way.您可以在代码中同时运行请求以并行化请求,但这不是最佳方式。

Personally, I recommend you to publish a PubSub message for each query to run.就个人而言,我建议您为要运行的每个查询发布一条 PubSub 消息。 So, here the process:所以,这里的过程:

  • Cloud Scheduler runs a first Cloud Functions Cloud Scheduler 运行第一个 Cloud Functions
  • The first Cloud Functions iterates over the table to query and publish a message in PubSub for each query to perform: the parameters to include in the message are up to you第一个 Cloud Functions 遍历表以查询并在 PubSub 中发布消息以执行每个查询:消息中包含的参数取决于您
  • A second Cloud Functions binds to the PubSub topic perform the query.第二个 Cloud Functions 绑定到 PubSub 主题执行查询。 (You can also create a PubSub push Subscription and call an HTTP Cloud FUnctions or something else in HTTP like Cloud Run or App Engine if you prefer) (您还可以创建 PubSub 推送订阅并调用 HTTP Cloud Functions 或 HTTP 中的其他内容,如 Cloud Run 或 App Engine,如果您愿意)

The process is scalable (is tomorrow you have 250 tables, no worry, it scales.!).这个过程是可扩展的(明天你有 250 个表,不用担心,它是可扩展的。!)。 Don't forget to set the retry parameter on the second Cloud Functions because parallel requests at high rate to BigQuery can create transient issues不要忘记在第二个 Cloud Functions 上设置重试参数,因为对 BigQuery 的高速并行请求可能会产生暂时性问题

Try Magnus - Workflow Automator - you can find it on GCP Marketplace - as a part of Potens.io Suite of Tools for BigQuery试试Magnus - Workflow Automator - 您可以在 GCP Marketplace 上找到它 - 作为用于 BigQuery 的Potens.io工具套件的一部分

In Magnus you can easily build workflow with Loop Task that will take a list of tables and for each will execute whatever query/statement you need.在 Magnus 中,您可以轻松地使用 Loop Task 构建工作流,该工作流将获取一个表列表,并为每个表执行您需要的任何查询/语句。

在此处输入图像描述

You can set Loop Task to be executed in Async Mode - so all tables will be process in parallel您可以将循环任务设置为在异步模式下执行 - 因此所有表都将并行处理

在此处输入图像描述

As you can see you can even control how many concurrent iterations you allow - so you respect respective API limits如您所见,您甚至可以控制允许的并发迭代次数 - 因此您尊重各自的 API 限制

Check out https://potensio.zendesk.com for more details查看https://potensio.zendesk.com了解更多详情

Disclosure: I am author and lead on Potens.io Project by Viant Tech披露:我是 Viant Tech 的 Potens.io 项目的作者和领导者

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM