BigQuery ML Tensorflow model - memory 中的 UDF

Question

I'm trying to run a Tensorflow model in BigQuery.我正在尝试在 BigQuery 中运行 Tensorflow model。 The model is a variant of BERT that is small enough to fit within BigQuery model limitations (<250MB). model 是 BERT 的变体，它足够小以符合 BigQuery model 的限制 (<250MB)。

I've tried to generate the predictions with the model using the following query directly from the BigQuery console:我尝试直接从 BigQuery 控制台使用以下查询生成 model 的预测：

SELECT
  input_1,
  input_2,
  prediction,
FROM
  ML.PREDICT(MODEL `MY_IMPORTED_MODEL`,
    (
    SELECT
      *,
    FROM
      `MY_DATA_TABLE`
    ))

However, the query resulted in the following error:但是，查询导致以下错误：

Resources exceeded during query execution: UDF out of memory.查询执行期间超出资源：memory 中的 UDF。

I have attempted to generate the predictions on a smaller sample from MY_DATA.TABLE with the following query:我试图使用以下查询从MY_DATA.TABLE生成对较小样本的预测：

SELECT
  input_1,
  input_2,
  prediction,
FROM
  ML.PREDICT(MODEL `MY_IMPORTED_MODEL`,
    (
    SELECT
      *,
    FROM
      `MY_DATA_TABLE`
    LIMIT 10
    ))

The smaller sample works perfectly fine.较小的样本工作得很好。

I thought that maybe an OVER expression would fix the issue by forcing the usage of more slots, so I produced the following query (spoiler alert: failed with the out-of-memory error):我认为OVER表达式可能会通过强制使用更多插槽来解决问题，因此我生成了以下查询（剧透警报：因内存不足错误而失败）：

SELECT
  input_1,
  input_2,
  prediction,
FROM
  ML.PREDICT(MODEL `MY_IMPORTED_MODEL`,
    (
    SELECT
      *,
      FLOOR(CAST(ROW_NUMBER() OVER (ORDER BY input_1) AS decimal) / 10) AS batch_number,
    FROM
      `MY_DATA_TABLE`
    ))

It seems that BigQuery attempts to feed too many rows at once to the model, resulting in a batch of the size that results in the out-of-memory error.似乎 BigQuery 试图一次向 model 提供太多行，导致一批大小导致内存不足错误。

Since I cannot specify the BATCH_SIZE parameter while calling the ML.PREDICT function, I'd like to know if there's any other way of obtaining the predictions that wouldn't result in the out-of-memory errors.由于我在调用ML.PREDICT function 时无法指定BATCH_SIZE参数，因此我想知道是否有任何其他方法可以获得不会导致内存不足错误的预测。

There are 180 million rows to run the prediction for, so I'd like to do it from within the BigQuery (not the GCP AI Platform).有 1.8 亿行要运行预测，所以我想在 BigQuery（而不是 GCP AI 平台）中进行预测。

Any ideas?有任何想法吗？

Answer 1

One option is to split your model into multiple models and call a separate ml.predict for each of them.一种选择是将您的 model 拆分为多个模型，并为每个模型调用一个单独的ml.predict 。

The keras functional api makes this quite easy as you can compose different models and access them later with the get_layer method. keras 功能 api 使这变得非常容易，因为您可以组合不同的模型并稍后使用get_layer方法访问它们。

BigQuery ML Tensorflow model - memory 中的 UDF

问题描述

1 个解决方案

解决方案1
1 2021-09-27 08:37:48

BigQuery ML Tensorflow model - memory 中的 UDF

问题描述

1 个解决方案

解决方案1 1 2021-09-27 08:37:48

解决方案1
1 2021-09-27 08:37:48