简体   繁体   English

BigQuery 基于 2 个 ID 列之间的链接递归连接

[英]BigQuery recursively join based on links between 2 ID columns

Given a table representing a many-many join between IDs like the following:给定一个表示 ID 之间多对多连接的表,如下所示:

WITH t AS (
  SELECT 1 AS id_1, 'a' AS id_2,
  UNION ALL SELECT 2, 'a'
  UNION ALL SELECT 2, 'b'
  UNION ALL SELECT 3, 'b'
  UNION ALL SELECT 4, 'c'
  UNION ALL SELECT 5, 'c'
  UNION ALL SELECT 6, 'd'
  UNION ALL SELECT 6, 'e'
  UNION ALL SELECT 7, 'f'
)

SELECT * FROM t
id_1 id_1 id_2 id_2
1 1个 a一种
2 2个 a一种
2 2个 b b
3 3个 b b
4 4个 c c
5 5个 c c
6 6个 d d
6 6个 e电子
7 7 f F

I would like to be able recursively join then aggregate rows in order to find each disconnected sub-graph represented by these links - that is each collection of IDs that are linked together:我希望能够递归加入然后聚合行,以便找到由这些链接表示的每个断开连接的子图 - 即链接在一起的每个 ID 集合:

网络图

The desired output for the example above would look something like this:上面示例所需的 output 看起来像这样:

id_1_coll id_1_coll id_2_coll id_2_coll
1, 2, 3 1, 2, 3 a, b一个,乙
4, 5 4, 5 c c
6 6个 d, e d, e
7 7 f F

where each row contains all the other IDs one could reach following the links in the table.其中每一行都包含可以通过表中的链接访问的所有其他 ID。

Note that 1 links to b even although there is no explicit link row because we can follow the path 1 --> a --> 2 --> b using the links in the first 3 rows.请注意,即使没有明确的链接行, 1也会链接到b ,因为我们可以使用前 3 行中的链接遵循路径1 --> a --> 2 --> b

One potential approach is to remodel the relationships between id_1 and id_2 such that we get all the links from id_1 to itself then use a recursive common table expression to traverse all the possible paths between id_1 values then aggregate (somewhat arbitrarily) to the lowest such value that can be reached from each id_1 .一种可能的方法是重塑id_1id_2之间的关系,以便我们获得从id_1到其自身的所有链接,然后使用递归公用表表达式遍历id_1值之间的所有可能路径,然后聚合(有点任意)到这样的最低值可以从每个id_1到达。

Explanation解释

Our steps are我们的步骤是

  1. Remodel the relationship into a series of self-joins for id_1将关系重塑为id_1的一系列自连接
  2. Map each id_1 to the lowest id_1 that it is linked to via a recursive CTE将每个id_1到它通过递归 CTE 链接到的最低id_1
  3. Aggregate the recursive CTE using the lowest id_1 s as the GROUP BY column and grabbing all the linked id_1 and id_2 values via the ARRAY_AGG() function使用最低的id_1作为GROUP BY列并通过ARRAY_AGG()函数获取所有链接的id_1id_2值来聚合递归 CTE

We can use something like this to remodel the relationships into a self join (1.):我们可以使用类似这样的方法将关系重塑为自连接 (1.):


SELECT 
  a.id_1, a.id_2, b.id_1 AS linked_id
FROM t as a
  INNER JOIN t as b 
    ON a.id_2 = b.id_2
WHERE a.id_1 != b.id_1

Next - to set up the recursive table expression (2.) we can tweak the query above to also give us the lowest ( LEAST ) of the values for id_1 at each link then use this as the base iteration:接下来 - 要设置递归表表达式 (2.),我们可以调整上面的查询以在每个链接处也为我们提供id_1的最低值 ( LEAST ),然后将其用作基础迭代:

WITH RECURSIVE base_iter AS (
  SELECT 
    a.id_1, b.id_1 AS linked_id, LEAST(a.id_1, b.id_1) AS lowest_linked_id
  FROM t as a
    INNER JOIN t as b 
    ON a.id_2 = b.id_2
  WHERE a.id_1 != b.id_1
)

We can also grab the lowest id_1 value at this time |id_1|linked_id|lowest_linked_id|我们此时也可以抓取最低的id_1值|id_1|linked_id|lowest_linked_id| |----|---------|----------------| |----|--------|----------------| |1|2|1| |1|2|1| |2|1|1| |2|1|1| |2|3|2| |2|3|2| |3|2|2| |3|2|2| |4|5|4| |4|5|4| |5|4|4| |5|4|4|

For our recursive loop, we want to maintain an ARRAY of linked ids and join each new iteration such that the id_1 value of the n+1 th iteration is equal to the linked_id value of the n th iteration AND the nth linked_id value is not in the array of previously linked ids.对于我们的递归循环,我们要维护一个链接 ID 的ARRAY并加入每个新迭代,使得第n+1次迭代的id_1值等于第n次迭代的linked_idAND并且第 n 个linked_id值不在先前链接的 id 的数组。

We can code this as follows:我们可以这样编码:


recursive_loop AS (
  SELECT id_1, linked_id, lowest_linked_id, [linked_id ] AS linked_ids
  FROM base_iter
  UNION ALL
    SELECT 
      prev_iter.id_1,  prev_iter.linked_id,
      iter.lowest_linked_id,
      ARRAY_CONCAT(iter.linked_ids, [prev_iter.linked_id])
    FROM base_iter AS prev_iter
    JOIN recursive_loop AS iter
      ON iter.id_1 = prev_iter.linked_id
      AND iter.lowest_linked_id <  prev_iter.lowest_linked_id
      AND prev_iter.linked_id NOT IN UNNEST(iter.linked_ids )      
)

Giving us the following results: |id_1|linked_id|lowest_linked_id|linked_ids|给我们以下结果:|id_1|linked_id|lowest_linked_id|linked_ids| |----|---------|------------|---| |----|--------|------------|----| |3|2|1|[1,2]| |3|2|1|[1,2]| |2|3|1|[1,2,3]| |2|3|1|[1,2,3]| |4|5|4|[5]| |4|5|4|[5]| |1|2|1|[2]| |1|2|1|[2]| |5|4|4|[4]| |5|4|4|[4]| |2|3|2|[3]| |2|3|2|[3]| |2|1|1|[1]| |2|1|1|[1]| |3|2|2|[2]| |3|2|2|[2]|

which we can now link back to the original table for the id_2 values then aggregate (3.) as shown in the complete query below我们现在可以链接回id_2值的原始表,然后聚合 (3.),如下面的完整查询所示

Solution解决方案

WITH RECURSIVE t AS (
  SELECT 1 AS id_1, 'a' AS id_2,
  UNION ALL SELECT 2, 'a'
  UNION ALL SELECT 2, 'b'
  UNION ALL SELECT 3, 'b'
  UNION ALL SELECT 4, 'c'
  UNION ALL SELECT 5, 'c'
  UNION ALL SELECT 6, 'd'
  UNION ALL SELECT 6, 'e'
  UNION ALL SELECT 7, 'f'
),

base_iter AS (
  SELECT 
    a.id_1, b.id_1 AS linked_id, LEAST(a.id_1, b.id_1) AS lowest_linked_id
  FROM t as a
    INNER JOIN t as b 
    ON a.id_2 = b.id_2
  WHERE a.id_1 != b.id_1
),

recursive_loop AS (
  SELECT id_1, linked_id, lowest_linked_id, [linked_id ] AS linked_ids
  FROM base_iter
  UNION ALL
    SELECT 
      prev_iter.id_1,  prev_iter.linked_id,
      iter.lowest_linked_id,
      ARRAY_CONCAT(iter.linked_ids, [prev_iter.linked_id])
    FROM base_iter AS prev_iter
    JOIN recursive_loop AS iter
      ON iter.id_1 = prev_iter.linked_id
      AND iter.lowest_linked_id <  prev_iter.lowest_linked_id
      AND prev_iter.linked_id NOT IN UNNEST(iter.linked_ids )

      
),

link_back AS (
  SELECT 
    t.id_1, IFNULL(lowest_linked_id, t.id_1) AS lowest_linked_id, t.id_2
  FROM t
    LEFT JOIN recursive_loop
    ON t.id_1 = recursive_loop.id_1
),

by_id_1 AS (
  SELECT 
    id_1,
    MIN(lowest_linked_id) AS grp

  FROM link_back
    GROUP BY 1
),

by_id_2 AS (
  SELECT 
    id_2,
    MIN(lowest_linked_id) AS grp

  FROM link_back
    GROUP BY 1
),

result AS (
  SELECT 
    by_id_1.grp,
    ARRAY_AGG(DISTINCT id_1 ORDER BY id_1) AS id1_coll,
    ARRAY_AGG(DISTINCT id_2 ORDER BY id_2) AS id2_coll,
  FROM 
    by_id_1
    INNER JOIN by_id_2
    ON by_id_1.grp = by_id_2.grp
  GROUP BY grp
)

SELECT grp, TO_JSON(id1_coll) AS id1_coll, TO_JSON(id2_coll) AS id2_coll  
FROM final ORDER BY grp

Giving us the required output:给我们所需的输出:

grp id1_coll id1_coll id2_coll id2_coll
1 1个 [1,2,3] [1,2,3] [a,b] [一,二]
4 4个 [4,5] [4,5] [c] [C]
6 6个 [6] [6] [d,e] [d,e]
7 7 [7] [7] [f] [F]

Limitations/Issues限制/问题

Unfortunately this approach is inneficient (we have to traverse every single pathway before aggregating it back together) and fails with the real-world case where we have several million join rows.不幸的是,这种方法效率低下(我们必须遍历每条路径,然后再将其聚合回一起)并且在我们有数百万个连接行的实际情况下失败。 When trying to execute on this data BigQuery runs up a huge "Slot time consumed" then eventually errors out with:当尝试对此数据执行时,BigQuery 会运行一个巨大的“消耗的槽时间”,然后最终会出错:

Resources exceeded during query execution: Your project or organization exceeded the maximum disk and memory limit available for shuffle operations.查询执行期间超出资源:您的项目或组织超出了可用于随机播放操作的最大磁盘和内存限制。 Consider provisioning more slots, reducing query concurrency, or using more efficient logic in this job.考虑在此作业中提供更多槽、降低查询并发性或使用更高效的逻辑。

I hope there might be a better way of doing the recursive join such that pathways can be merged/aggregated as we go (if we have an id_1 value AND a linked_id in already in the list of linked_ids we dont need to check it further).我希望可能有更好的方法来进行递归连接,以便可以在我们进行时合并/聚合路径(如果我们有一个id_1值和一个linked_id已经在 linked_ids 列表中,我们不需要进一步检查它)。

Using ROW_NUMBER() the query is as the follow:使用 ROW_NUMBER() 查询如下:

WITH RECURSIVE
t AS (
  SELECT 1 AS id_1, 'a' AS id_2,
  UNION ALL SELECT 2, 'a'
  UNION ALL SELECT 2, 'b'
  UNION ALL SELECT 3, 'b'
  UNION ALL SELECT 4, 'c'
  UNION ALL SELECT 5, 'c'
  UNION ALL SELECT 6, 'd'
  UNION ALL SELECT 6, 'e'
  UNION ALL SELECT 7, 'f'
),
t1 AS (
  SELECT ROW_NUMBER() OVER(ORDER BY t.id_1) n, t.id_1, t.id_2 FROM t
),
t2 AS (
  SELECT n, [n] n_arr, id_1, id_2 FROM t1
    WHERE n IN (SELECT MIN(n) FROM t1 GROUP BY id_1) -- for reducing rows
  UNION ALL
  SELECT t2.n, ARRAY_CONCAT(t2.n_arr, [t1.n]), t1.id_1, t1.id_2
    FROM t2 JOIN t1 ON
      t2.n < t1.n AND  -- for reducing rows
      t1.n NOT IN UNNEST(t2.n_arr) AND
      (t2.id_1 = t1.id_1 OR t2.id_2 = t1.id_2)
),
t3 AS (
  SELECT
    n,
    ARRAY_AGG(DISTINCT id_1 ORDER BY id_1) arr_1,
    ARRAY_AGG(DISTINCT id_2 ORDER BY id_2) arr_2
  FROM t2
  WHERE n IN (SELECT MIN(n) FROM t2 GROUP BY id_1)
  GROUP BY n
)
SELECT n, TO_JSON(arr_1), TO_JSON(arr_2) FROM t3 ORDER BY n
  • t1: Append with row numbers. t1: Append 带行号。
  • t2: Extract rows matching either id_1 or id_2 by recursive query. t2:通过递归查询提取匹配id_1或id_2的行。
  • t3: Make arrays from id_1 and id_2 with ARRAY_AGG(). t3:使用 ARRAY_AGG() 从 id_1 和 id_2 生成 arrays。

However, it may not help your Limitations/Issues .但是,它可能对您的Limitations/Issues没有帮助。

The way this question is phrased makes it appear you want "show me distinct groups from a presorted list, unchained to a previous group".这个问题的措辞方式让人觉得你想要“向我展示来自预排序列表的不同组,与之前的组无关”。 For that, something like this should suffice (assuming auto-incrementing order/one or both id's move to the next value):为此,这样的事情就足够了(假设自动递增顺序/一个或两个 id 移动到下一个值):

SELECT GrpNr,
  STRING_AGG(DISTINCT CAST(id_1 as STRING), ',') as id_1_coll,
  STRING_AGG(DISTINCT CAST(id_2 as STRING), ',') as id_2_coll
FROM
(
SELECT id_1, id_2,
  SUM(CASE WHEN a.id_1 <> a.previous_id_1 and a.id_2 <> a.previous_id_2 THEN 1 ELSE 0 END) 
    OVER (ORDER BY RowNr) as GrpNr
FROM
(
SELECT *,
  ROW_NUMBER() OVER () as RowNr,
  LAG(t.id_1, 1) OVER (ORDER BY 1) AS previous_id_1,
  LAG(t.id_2, 1) OVER (ORDER BY 1) AS previous_id_2
FROM t
) a
ORDER BY RowNr
) a
GROUP BY GrpNr
ORDER BY GrpNr

I don't think this is the question you mean to ask.我不认为这是你要问的问题。 This seems to be a graph-walking problem as referenced in the other answers, and in the response from @GordonLinoff to the question here , which I tested (and presume works for BigQuery).这似乎是其他答案中引用的图形遍历问题,以及@GordonLinoff 对此处问题的响应,我测试了该问题(并假定适用于 BigQuery)。

This can also be done using sequential updates as done by @RomanPekar here (which I also tested).这也可以使用@RomanPekar here (我也测试过)所做的顺序更新来完成。 The main consideration seems to be performance.主要考虑的似乎是性能。 I'd assume dbms have gotten better at recursion since this was posted.自从这篇文章发布以来,我假设 dbms 在递归方面变得更好了。

Rolling it up in either case should be fairly easy using String_Agg() as given above or as you have.使用上面给出的 String_Agg() 或您所拥有的方法,在任何一种情况下都可以很容易地滚动它。

I'd be curious to see a more accurate representation of the data.我很想看到更准确的数据表示。 If there is some consistency to how the data is stored/limitations to levels of nesting/other group structures there may be a shortcut approach other than recursion or iterative updates.如果数据的存储方式/嵌套级别的限制/其他组结构存在某种一致性,则可能存在递归或迭代更新以外的捷径方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM