繁体   English   中英

BIGQUERY - 查询超出资源限制

[英]BIGQUERY - Query Exceeded resource limit

我正在运行以下查询以加入两个表并根据模糊逻辑(Levenshtein 距离)获取某些记录

WITH main_table as (
  select *
from 
`project.data.Roof_Address`
), reference_table as (
  select *
from `project.data.DATA_TREE_Address` 
)
select
  DR_NBR,
  ARRAY_AGG(
    STRUCT(n.LotSizeSqFt)
    ORDER BY EDIT_DISTANCE(l.ordered_fullname, n.ordered_fullname) LIMIT 1
  )[OFFSET(0)].*,
  ARRAY_AGG(
    EDIT_DISTANCE(l.ordered_fullname, n.ordered_fullname)  LIMIT 1
  )[OFFSET(0)] distance_score
FROM main_table l
CROSS JOIN reference_table n

GROUP BY 1
having ARRAY_AGG(
    EDIT_DISTANCE(l.ordered_fullname, n.ordered_fullname)  LIMIT 1
  )[OFFSET(0)] < 10

此查询将返回

Project_Id(Dr_NBR)

从第一张桌子和

项目面积(LotSizeSqFt)

来自基于 Levenshtein Score 过滤器的第二个表。

此查询导致以下错误

在此处输入图像描述

任何建议如何优化上述查询?

我使用的距离是从下面 function

#standardSQL
CREATE TEMPORARY FUNCTION EDIT_DISTANCE(string1 STRING, string2 STRING)
RETURNS INT64
LANGUAGE js AS """
  var _extend = function(dst) {
    var sources = Array.prototype.slice.call(arguments, 1);
    for (var i=0; i<sources.length; ++i) {
      var src = sources[i];
      for (var p in src) {
        if (src.hasOwnProperty(p)) dst[p] = src[p];
      }
    }
    return dst;
  };

  var Levenshtein = {
    /**
     * Calculate levenshtein distance of the two strings.
     *
     * @param str1 String the first string.
     * @param str2 String the second string.
     * @return Integer the levenshtein distance (0 and above).
     */
    get: function(str1, str2) {
      // base cases
      if (str1 === str2) return 0;
      if (str1.length === 0) return str2.length;
      if (str2.length === 0) return str1.length;

      // two rows
      var prevRow  = new Array(str2.length + 1),
          curCol, nextCol, i, j, tmp;

      // initialise previous row
      for (i=0; i<prevRow.length; ++i) {
        prevRow[i] = i;
      }

      // calculate current row distance from previous row
      for (i=0; i<str1.length; ++i) {
        nextCol = i + 1;

        for (j=0; j<str2.length; ++j) {
          curCol = nextCol;

          // substution
          nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
          // insertion
          tmp = curCol + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }
          // deletion
          tmp = prevRow[j + 1] + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }

          // copy current col value into previous (in preparation for next iteration)
          prevRow[j] = curCol;
        }

        // copy last col value into previous (in preparation for next iteration)
        prevRow[j] = nextCol;
      }

      return nextCol;
    }

  };

  var the_string1;

  try {
    the_string1 = decodeURI(string1).toLowerCase();
  } catch (ex) {
    the_string1 = string1.toLowerCase();
  }

  try {
    the_string2 = decodeURI(string2).toLowerCase();
  } catch (ex) {
    the_string2 = string2.toLowerCase();
  }

  return Levenshtein.get(the_string1, the_string2) 

""";

Roof_Address 表的快照

在此处输入图像描述

DATA_TREE_Address 的快照

在此处输入图像描述

主要查询成本很可能是 ORDER by 在:

ARRAY_AGG(
    STRUCT(n.LotSizeSqFt)
    ORDER BY EDIT_DISTANCE(l.ordered_fullname, n.ordered_fullname) LIMIT 1
  )[OFFSET(0)].*,

我看到您只为每个 array_agg 返回一条记录。

我建议删除 ARRAY_AGG 并对 EDIT_DISTANCE 的结果执行 MAX 或 MIN。 MAX 或 MIN 比 ORDERING ALL 记录并获取第一个或最后一个记录要便宜得多。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM