简体   繁体   中英

Recursive query with optional depth limit with MySQL 5.6

I have two table schemas (MySQL 5.6 so no CTE), roughly looking like this:

CREATE TABLE nodes (
  node_id INT PRIMARY KEY,
  name VARCHAR(10)
);

CREATE TABLE edges (
  edge_id INT PRIMARY KEY,
  source INT,
  target INT,
  FOREIGN KEY (source) REFERENCES nodes(node_id),
  FOREIGN KEY (target) REFERENCES nodes(node_id)
);

In our design, a logical edge between two nodes (logically n1 -> n2 ) is actually represented as ( n1 -> proxy node -> n2 ) in the db. The reason we use two edges and a proxy node for a logical edge is so that we can store properties on the edge. Therefore, when a client queries for two nodes connected by an edge, the query is translated to query three connected nodes instead.

I have written a query to get a path with a fixed length. For example, "give me all the paths that start with a node with some properties, and end with a node with some properties, with exactly 5 edges on the path." This is done without using recursion on the SQL side; I just generate a long query programmatically with the specified fixed length.

The challenge is, we want to support querying of a variable-length path. For example, "give me all the paths that start with a node with some properties, and end with a node with some properties, with no fewer than 3 edges and no more than 10 edges on the path." Is this feasible without (or even with) CTE?

EDIT:

Some sample data:

-- Logical nodes are 1, 3, 5, 7, 9, 11. The rest are proxy nodes.
INSERT INTO nodes VALUES
  (1, 'foo'),
  (2, '_proxy_'),
  (3, 'foo'),
  (4, '_proxy_'),
  (5, 'bar'),
  (6, '_proxy_'),
  (7, 'bar'),
  (8, '_proxy_'),
  (9, 'bar'),
  (10, '_proxy_'),
  (11, 'bar');

-- Connects 1 -> 2 -> ... -> 11.
INSERT INTO edges VALUES
  (1, 1, 2),
  (2, 2, 3),
  (3, 3, 4),
  (4, 4, 5),
  (5, 5, 6),
  (6, 6, 7),
  (7, 7, 8),
  (8, 8, 9),
  (9, 9, 10),
  (10, 10, 11);

The query can be, "select the ID and names of all the nodes on a path such that the path starts with a node named 'foo' and ends with a node named 'bar', with at least 2 nodes and at most 4 nodes on the path." Such paths include 1 -> 3 -> 5 , 1 -> 3 -> 5 -> 7 , 3 -> 5 , 3 -> 5 -> 7 , and 3 -> 5 -> 7 -> 9 . So the result set should include the IDs and names of nodes 1, 3, 5, 7, 9.

The following query returns all paths of interest in comma separated strings.

with recursive rcte as (
  select e.source, e.target, 1 as depth, concat(e.source) as path
  from nodes n
  join edges e on e.source = n.node_id
  where n.name = 'foo' -- start node name
  union all
  select e.source, e.target, r.depth + 1 as depth, concat_ws(',', r.path, e.source)
  from rcte r
  join edges p on p.source = r.target -- p for proxy
  join edges e on e.source = p.target
  where r.depth < 4 -- max path nodes
) 
select r.path
from rcte r
join nodes n on n.node_id = r.source
where r.depth >= 2 -- min path nodes
  and n.name = 'bar' -- end node name

The result looks like this:

| path    |
| ------- |
| 3,5     |
| 1,3,5   |
| 3,5,7   |
| 1,3,5,7 |
| 3,5,7,9 |

View on DB Fiddle

You can now parse the strings in application code and merge/union the arrays. If you only want the contained node ids, you can also change the outer query to:

select distinct r2.source
from rcte r
join nodes n on n.node_id = r.source
join rcte r2 on find_in_set(r2.source, r.path)
where r.depth >= 2 -- min path nodes
  and n.name = 'bar' -- end node name

Result:

| source |
| ------ |
| 1      |
| 3      |
| 5      |
| 7      |
| 9      |

View on DB Fiddle

Note that a JOIN on FIND_IN_SET() might be slow, if rcte contains too many rows. I would rather do this step in application code, which should be quite simple in a procedural language.

MySQL 5.6 solution

Prior to MySQL 8.0 and MariaDB 10.2 there was no way for recursions. Farther there are many other limitations, which make a workaround difficult. For example:

  • No dynamic queries in stored functions
  • No way to use a temporary table twice in a single statement
  • No TEXT type in memmory engine

However - an RCTE can be emulated in a stored procedure moving rows between two (temporary) tables. The following procedure does that:

delimiter //
create procedure get_path(
  in source_name text,
  in target_name text,
  in min_depth int,
  in max_depth int
)
begin
  create temporary table tmp_sources (id int, depth int, path text) engine=innodb;
  create temporary table tmp_targets like tmp_sources;

  insert into tmp_sources (id, depth, path)
    select n.node_id, 1, n.node_id
    from nodes n
    where n.name = source_name;

  set @depth = 1;
  while @depth < max_depth do
    set @depth = @depth+1;
    insert into tmp_targets(id, depth, path)
      select e.target, @depth, concat_ws(',', t.path, e.target)
      from tmp_sources t
      join edges p on p.source = t.id
      join edges e on e.source = p.target
      where t.depth = @depth - 1;

    insert into tmp_sources (id, depth, path)
      select id, depth, path
      from tmp_targets;

    truncate tmp_targets;
  end while;

  select t.path
    from tmp_sources t
    join nodes n on n.node_id = t.id
    where n.name = target_name
      and t.depth >= min_depth;
end //
delimiter ;

Use it as:

call get_path('foo', 'bar', 2, 4)

Result:

| path    |
| ------- |
| 3,5     |
| 1,3,5   |
| 3,5,7   |
| 1,3,5,7 |
| 3,5,7,9 |

View on DB Fiddle

This is far from being optimal. If the result has many or long paths, you might need to define some indexes on the temprary tables. Also I don't like the idea of creating (temporary) tables in stroed procedures. See it as "proof of concept". Use it on your own risk.

I've solved this sort of problem with a transitive closure table . This enumerates every direct and indirect path through your nodes. The edges you currently have are paths of length 1. But you also need paths of length 0 (ie, a node has a path to itself), and then every path from one source node to an eventual target node, for paths with length greater than 1.

create table closure (
  source int,
  target int,
  length int,
  is_direct bool,
  primary key (source, target)
);

insert into closure values
  (1, 1, 0, false), (1, 2, 1, true), (1, 3, 2, false), (1, 4, 3, false), (1, 5, 4, false), (1, 6, 5, false), (1, 7, 6, false), (1, 8, 7, false), (1, 9, 8, false), (1, 10, 9, false), (1, 11, 10, false),
  (2, 2, 0, false), (2, 3, 1, true), (2, 4, 2, false), (2, 5, 3, false), (2, 6, 4, false), (2, 7, 5, false), (2, 8, 6, false), (2, 9, 7, false), (2, 10, 8, false), (2, 11, 9, false),
  (3, 3, 0, false), (3, 4, 1, true), (3, 5, 2, false), (3, 6, 3, false), (3, 7, 4, false), (3, 8, 5, false), (3, 9, 6, false), (3, 10, 7, false), (3, 11, 8, false),
  (4, 4, 0, false), (4, 5, 1, true), (4, 6, 2, false), (4, 7, 3, false), (4, 8, 4, false), (4, 9, 5, false), (4, 10, 6, false), (4, 11, 7, false),
  (5, 5, 0, false), (5, 6, 1, true), (5, 7, 2, false), (5, 8, 3, false), (5, 9, 4, false), (5, 10, 5, false), (5, 11, 6, false),
  (6, 6, 0, false), (6, 7, 1, true), (6, 8, 2, false), (6, 9, 3, false), (6, 10, 4, false), (6, 11, 5, false),
  (7, 7, 0, false), (7, 8, 1, true), (7, 9, 2, false), (7, 10, 3, false), (7, 11, 4, false),
  (8, 8, 0, false), (8, 9, 1, true), (8, 10, 2, false), (8, 11, 3, false),
  (9, 9, 0, false), (9, 10, 1, true), (9, 11, 2, true),
  (10, 10, 0, false), (10, 11, 1, true),
  (11, 11, 0, false);

Now we can write your query:

select the ID and names of all the nodes on a path such that the path starts with a node named 'foo' and ends with a node named 'bar', with at least 2 nodes and at most 4 nodes on the path.

I translate this into paths of length 4,6,8 because you have a proxy node in between each, so it really takes two hops to go between nodes.

select source.node_id as source_node, target.node_id as target_node, c.length
from nodes as source
join closure as c on source.node_id = c.source
join nodes as target on c.target = target.node_id
where source.name='foo' and target.name = 'bar' and c.length in (4,6,8)

Here's the result, which in fact also includes node 11:

+-------------+-------------+--------+
| source_node | target_node | length |
+-------------+-------------+--------+
|           1 |           5 |      4 |
|           1 |           7 |      6 |
|           1 |           9 |      8 |
|           3 |           7 |      4 |
|           3 |           9 |      6 |
|           3 |          11 |      8 |
+-------------+-------------+--------+

Re comment from Paul Spiegel:

Once you have the endpoints of the path, you can query the closure for all paths that start at the source, and end at a node that also has a path to the target.

select source.node_id as source_node, target.node_id as target_node,
  group_concat(i1.target order by i1.target) as interim_nodes
from nodes as source
join closure as c on source.node_id = c.source
join nodes as target on c.target = target.node_id
join closure as i1 on source.node_id = i1.source
join closure as i2 on target.node_id = i2.target and i1.target = i2.source
where source.name='foo' and target.name = 'bar' and c.length in (4,6,8)
group by source.node_id, target.node_id

+-------------+-------------+---------------------+
| source_node | target_node | interim_nodes       |
+-------------+-------------+---------------------+
|           1 |           5 | 1,2,3,4,5           |
|           1 |           7 | 1,2,3,4,5,6,7       |
|           1 |           9 | 1,2,3,4,5,6,7,8,9   |
|           3 |           7 | 3,4,5,6,7           |
|           3 |           9 | 3,4,5,6,7,8,9       |
|           3 |          11 | 3,4,5,6,7,8,9,10,11 |
+-------------+-------------+---------------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM