[英]BIGQUERY - Query Exceeded resource limit
我正在运行以下查询以加入两个表并根据模糊逻辑(Levenshtein 距离)获取某些记录
WITH main_table as (
select *
from
`project.data.Roof_Address`
), reference_table as (
select *
from `project.data.DATA_TREE_Address`
)
select
DR_NBR,
ARRAY_AGG(
STRUCT(n.LotSizeSqFt)
ORDER BY EDIT_DISTANCE(l.ordered_fullname, n.ordered_fullname) LIMIT 1
)[OFFSET(0)].*,
ARRAY_AGG(
EDIT_DISTANCE(l.ordered_fullname, n.ordered_fullname) LIMIT 1
)[OFFSET(0)] distance_score
FROM main_table l
CROSS JOIN reference_table n
GROUP BY 1
having ARRAY_AGG(
EDIT_DISTANCE(l.ordered_fullname, n.ordered_fullname) LIMIT 1
)[OFFSET(0)] < 10
此查询将返回
Project_Id(Dr_NBR)
从第一张桌子和
项目面积(LotSizeSqFt)
来自基于 Levenshtein Score 过滤器的第二个表。
此查询导致以下错误
任何建议如何优化上述查询?
我使用的距离是从下面 function
#standardSQL
CREATE TEMPORARY FUNCTION EDIT_DISTANCE(string1 STRING, string2 STRING)
RETURNS INT64
LANGUAGE js AS """
var _extend = function(dst) {
var sources = Array.prototype.slice.call(arguments, 1);
for (var i=0; i<sources.length; ++i) {
var src = sources[i];
for (var p in src) {
if (src.hasOwnProperty(p)) dst[p] = src[p];
}
}
return dst;
};
var Levenshtein = {
/**
* Calculate levenshtein distance of the two strings.
*
* @param str1 String the first string.
* @param str2 String the second string.
* @return Integer the levenshtein distance (0 and above).
*/
get: function(str1, str2) {
// base cases
if (str1 === str2) return 0;
if (str1.length === 0) return str2.length;
if (str2.length === 0) return str1.length;
// two rows
var prevRow = new Array(str2.length + 1),
curCol, nextCol, i, j, tmp;
// initialise previous row
for (i=0; i<prevRow.length; ++i) {
prevRow[i] = i;
}
// calculate current row distance from previous row
for (i=0; i<str1.length; ++i) {
nextCol = i + 1;
for (j=0; j<str2.length; ++j) {
curCol = nextCol;
// substution
nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
// insertion
tmp = curCol + 1;
if (nextCol > tmp) {
nextCol = tmp;
}
// deletion
tmp = prevRow[j + 1] + 1;
if (nextCol > tmp) {
nextCol = tmp;
}
// copy current col value into previous (in preparation for next iteration)
prevRow[j] = curCol;
}
// copy last col value into previous (in preparation for next iteration)
prevRow[j] = nextCol;
}
return nextCol;
}
};
var the_string1;
try {
the_string1 = decodeURI(string1).toLowerCase();
} catch (ex) {
the_string1 = string1.toLowerCase();
}
try {
the_string2 = decodeURI(string2).toLowerCase();
} catch (ex) {
the_string2 = string2.toLowerCase();
}
return Levenshtein.get(the_string1, the_string2)
""";
Roof_Address 表的快照
DATA_TREE_Address 的快照
主要查询成本很可能是 ORDER by 在:
ARRAY_AGG(
STRUCT(n.LotSizeSqFt)
ORDER BY EDIT_DISTANCE(l.ordered_fullname, n.ordered_fullname) LIMIT 1
)[OFFSET(0)].*,
我看到您只为每个 array_agg 返回一条记录。
我建议删除 ARRAY_AGG 并对 EDIT_DISTANCE 的结果执行 MAX 或 MIN。 MAX 或 MIN 比 ORDERING ALL 记录并获取第一个或最后一个记录要便宜得多。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.