简体   繁体   中英

How to assign connected subgraph ID to each node in an undirected graph in Oracle SQL?

How to partition a set of links in a link_tbl defined by the from_node and to_node . There are 10M nodes and 25M links (no more than 20 links per node).

For example
在此处输入图片说明

The graph consists of three disjoint subgraphs.

  create table link_tbl as (
  select 'AY' as linkid, 'A' as from_node, 'Y' as to_node from dual union all
  select 'AB', 'A', 'B' from dual union all
  select 'CA', 'C', 'A' from dual union all      
  select 'GE', 'G', 'E' from dual union all
  select 'CB', 'C', 'B' from dual union all
  select 'EF', 'E', 'F' from dual union all
  select 'NM', 'N', 'M' from dual
  );
  --compute subnetid
  select * from link_tbl order by subnetid;

To get a result set with subnetid values?

在此处输入图片说明

I can use a variant of flood fill in Java to assign subgraph id to each node of a graph. But can this be done in SQL?

Pseudocode:
- Rank nodes with consecutive integers. - Represent links as (node1*100M + node2) as number(19,0) - Union links node1,node2 with node2, node1 and sort - Get first node as an anchor node and add to subgraph_nodes' - Iterate over link table - add node2 to subgraph_nodes if node1 is in (subgraph_nodes) AND node2 not in subgraph_nodes

This should add all connected nodes to subgraph_nodes

This will be sufficient because now I can add subgraph_id' to node table and select all nodes without subgraph id and repeat subgraph analysis.

There are 10M nodes represented as consecutive integers and 25M links (no more than 20 links per node), represented as (from_node*100M + to_node) as ID, from_node, to_node .

A while ago I wrote an answer to the question How to find all connected subgraphs of an undirected graph . It was written for SQL Server, but Oracle supports standard recursive queries, so it was easy to convert it to Oracle. It may be written more efficiently using constructs specific to Oracle.

Sample data

create table link_tbl as (
  select 'AY' as linkid, 'A' as from_node, 'Y' as to_node from dual union all
  select 'AB', 'A', 'B' from dual union all
  select 'CA', 'C', 'A' from dual union all      
  select 'GE', 'G', 'E' from dual union all
  select 'CB', 'C', 'B' from dual union all
  select 'EF', 'E', 'F' from dual union all
  select 'NM', 'N', 'M' from dual
);

Query

WITH
CTE_Nodes
AS
(
    SELECT from_node AS Node
    FROM link_tbl

    UNION

    SELECT to_node AS Node
    FROM link_tbl
)
,CTE_Pairs
AS
(
    SELECT from_node AS Node1, to_node AS Node2
    FROM link_tbl
    WHERE from_node <> to_node

    UNION

    SELECT to_node AS Node1, from_node AS Node2
    FROM link_tbl
    WHERE from_node <> to_node
)
,CTE_Recursive (AnchorNode, Node1, Node2, NodePath, Lvl)
AS
(
    SELECT
        CAST(CTE_Nodes.Node AS varchar(2000)) AS AnchorNode
        , Node1
        , Node2
        , CAST(',' || Node1 || ',' || Node2 || ',' AS varchar(2000)) AS NodePath
        , 1 AS Lvl
    FROM 
        CTE_Pairs
        INNER JOIN CTE_Nodes ON CTE_Nodes.Node = CTE_Pairs.Node1

    UNION ALL

    SELECT 
        CTE_Recursive.AnchorNode
        , CTE_Pairs.Node1
        , CTE_Pairs.Node2
        , CAST(CTE_Recursive.NodePath || CTE_Pairs.Node2 || ',' AS varchar(2000)) AS NodePath
        , CTE_Recursive.Lvl + 1 AS Lvl
    FROM
        CTE_Pairs
        INNER JOIN CTE_Recursive ON CTE_Recursive.Node2 = CTE_Pairs.Node1
    WHERE
        CTE_Recursive.NodePath NOT LIKE CAST('%,' || CTE_Pairs.Node2 || ',%' AS varchar(2000))
)
,CTE_RecursionResult
AS
(
    SELECT AnchorNode, Node1, Node2
    FROM CTE_Recursive
)
,CTE_CleanResult
AS
(
    SELECT AnchorNode, Node1 AS Node
    FROM CTE_RecursionResult

    UNION

    SELECT AnchorNode, Node2 AS Node
    FROM CTE_RecursionResult
)
SELECT
    CTE_Nodes.Node
    ,LISTAGG(CTE_CleanResult.Node, ',') WITHIN GROUP (ORDER BY CTE_CleanResult.Node) AS GroupMembers
    ,DENSE_RANK() OVER (ORDER BY LISTAGG(CTE_CleanResult.Node, ',') WITHIN GROUP (ORDER BY CTE_CleanResult.Node)) AS GroupID
FROM
    CTE_Nodes
    INNER JOIN CTE_CleanResult ON CTE_CleanResult.AnchorNode = CTE_Nodes.Node
GROUP BY
    CTE_Nodes.Node
ORDER BY
    GroupID
    ,CTE_Nodes.Node
;

Result

+------+--------------+---------+
| NODE | GROUPMEMBERS | GROUPID |
+------+--------------+---------+
| A    | A,B,C,Y      |       1 |
| B    | A,B,C,Y      |       1 |
| C    | A,B,C,Y      |       1 |
| Y    | A,B,C,Y      |       1 |
| E    | E,F,G        |       2 |
| F    | E,F,G        |       2 |
| G    | E,F,G        |       2 |
| M    | M,N          |       3 |
| N    | M,N          |       3 |
+------+--------------+---------+

https://dbfiddle.uk/?rdbms=oracle_11.2&fiddle=e61cf73824e7718a4686430ccd7398e7

How it works

CTE_Nodes

CTE_Nodes gives the list of all Nodes that appear in both from_node and to_node columns. Since they can appear in any order we UNION both columns together. UNION also removes any duplicates.

+------+
| NODE |
+------+
| A    |
| B    |
| C    |
| E    |
| F    |
| G    |
| M    |
| N    |
| Y    |
+------+

CTE_Pairs

CTE_Pairs gives the list of all edges of the graph in both directions. Again, UNION is used to remove any duplicates.

+-------+-------+
| NODE1 | NODE2 |
+-------+-------+
| A     | B     |
| A     | C     |
| A     | Y     |
| B     | A     |
| B     | C     |
| C     | A     |
| C     | B     |
| E     | F     |
| E     | G     |
| F     | E     |
| G     | E     |
| M     | N     |
| N     | M     |
| Y     | A     |
+-------+-------+

CTE_Recursive

CTE_Recursive is the main part of the query that recursively traverses the graph starting from each unique Node. These starting rows are produced by the first part of UNION ALL . The second part of UNION ALL recursively joins to itself linking Node2 to Node1 . Since we pre-made CTE_Pairs with all edges written in both directions, we can always link only Node2 to Node1 and we'll get all paths in the graph. At the same time the query builds NodePath - a string of comma-delimited Nodes that have been traversed so far. It is used in the WHERE filter:

CTE_Recursive.NodePath NOT LIKE CAST('%,' || CTE_Pairs.Node2 || ',%' AS varchar(2000))

As soon as we come across the Node that had been included in the Path before, the recursion stops as the list of connected nodes is exhausted. AnchorNode is the starting Node for the recursion, it will be used later to group results. Lvl is not really used, I included it for better understanding of what is going on.

+------------+-------+-------+-----------+-----+
| ANCHORNODE | NODE1 | NODE2 | NODEPATH  | LVL |
+------------+-------+-------+-----------+-----+
| A          | A     | Y     | ,A,Y,     |   1 |
| A          | A     | C     | ,A,C,     |   1 |
| A          | A     | B     | ,A,B,     |   1 |
| B          | B     | C     | ,B,C,     |   1 |
| B          | B     | A     | ,B,A,     |   1 |
| C          | C     | B     | ,C,B,     |   1 |
| C          | C     | A     | ,C,A,     |   1 |
| E          | E     | G     | ,E,G,     |   1 |
| E          | E     | F     | ,E,F,     |   1 |
| F          | F     | E     | ,F,E,     |   1 |
| G          | G     | E     | ,G,E,     |   1 |
| M          | M     | N     | ,M,N,     |   1 |
| N          | N     | M     | ,N,M,     |   1 |
| Y          | Y     | A     | ,Y,A,     |   1 |
| Y          | A     | B     | ,Y,A,B,   |   2 |
| C          | A     | B     | ,C,A,B,   |   2 |
| Y          | A     | C     | ,Y,A,C,   |   2 |
| B          | A     | C     | ,B,A,C,   |   2 |
| C          | A     | Y     | ,C,A,Y,   |   2 |
| B          | A     | Y     | ,B,A,Y,   |   2 |
| C          | B     | A     | ,C,B,A,   |   2 |
| A          | B     | C     | ,A,B,C,   |   2 |
| B          | C     | A     | ,B,C,A,   |   2 |
| A          | C     | B     | ,A,C,B,   |   2 |
| G          | E     | F     | ,G,E,F,   |   2 |
| F          | E     | G     | ,F,E,G,   |   2 |
| B          | A     | Y     | ,B,C,A,Y, |   3 |
| C          | A     | Y     | ,C,B,A,Y, |   3 |
| Y          | B     | C     | ,Y,A,B,C, |   3 |
| Y          | C     | B     | ,Y,A,C,B, |   3 |
+------------+-------+-------+-----------+-----+

CTE_CleanResult

CTE_CleanResult leaves only relevant parts from CTE_Recursive and again merges both Node1 and Node2 using UNION .

+------------+------+
| ANCHORNODE | NODE |
+------------+------+
| A          | A    |
| A          | B    |
| A          | C    |
| A          | Y    |
| B          | A    |
| B          | B    |
| B          | C    |
| B          | Y    |
| C          | A    |
| C          | B    |
| C          | C    |
| C          | Y    |
| E          | E    |
| E          | F    |
| E          | G    |
| F          | E    |
| F          | F    |
| F          | G    |
| G          | E    |
| G          | F    |
| G          | G    |
| M          | M    |
| M          | N    |
| N          | M    |
| N          | N    |
| Y          | A    |
| Y          | B    |
| Y          | C    |
| Y          | Y    |
+------------+------+

Final SELECT

Now we need to build a string of comma-separated Node values for each AnchorNode . LISTAGG does it. DENSE_RANK() calculates the GroupID numbers for each AnchorNode .


Efficiency

You have rather large tables, so trying to find all groups at once in a single query above may be quite inefficient.

One way to make it more efficient is not to use a single query for the whole dataset. Don't try to find all subnets/groups at the same time. Limit the starting point to one node. Add WHERE CTE_Nodes.Node = 'some node' to the first part of CTE_Recursive . The query will find all nodes of one subnet. Delete these found nodes from the big table, pick another starting node, repeat in a loop until the big table is empty. If you materialise CTE_Nodes and CTE_Pairs as temp tables with indexes, it may help as well.

I have never worked with Oracle and don't know its syntax for procedural code, so I'll write in a pseudocode below.

Prepare temp tables

CREATE TABLE Nodes AS
(
    SELECT from_node AS Node
    FROM link_tbl

    UNION

    SELECT to_node AS Node
    FROM link_tbl
);
CREATE INDEX IX_Node ON Nodes (Node);

CREATE TABLE Pairs AS
(
    SELECT from_node AS Node1, to_node AS Node2
    FROM link_tbl
    WHERE from_node <> to_node

    UNION

    SELECT to_node AS Node1, from_node AS Node2
    FROM link_tbl
    WHERE from_node <> to_node
);
CREATE INDEX IX_Node1 ON Pairs (Node1);
CREATE INDEX IX_Node2 ON Pairs (Node2);

CREATE TABLE Subgraph AS
(
    SELECT Node FROM Nodes WHERE 1=0
);

CREATE TABLE Result
(
    GroupID int NOT NULL,
    Node varchar(10) NOT NULL
);

SET :GroupID = 0;

The main loop starts

Pick one Node from Nodes . Any row will do. Save the Node into variable. Again, I don't know proper Oracle syntax for it.

SELECT :N = Node FROM Nodes WHERE rownum=1;

If Nodes is empty, stop the loop.

SET :GroupID = :GroupID + 1;

Run the recursive query starting the recursion from one specific Node chosen above.

INSERT INTO Subgraph (Node)
WITH
CTE_Recursive (AnchorNode, Node1, Node2, NodePath, Lvl)
AS
(
    SELECT
        CAST(Nodes.Node AS varchar(2000)) AS AnchorNode
        , Node1
        , Node2
        , CAST(',' || Node1 || ',' || Node2 || ',' AS varchar(2000)) AS NodePath
        , 1 AS Lvl
    FROM 
        Pairs
        INNER JOIN Nodes ON Nodes.Node = Pairs.Node1
    WHERE
        Nodes.Node = :N  -- 'A'
        -- use variable here, don't know what the syntax is

    UNION ALL

    SELECT 
        CTE_Recursive.AnchorNode
        , Pairs.Node1
        , Pairs.Node2
        , CAST(CTE_Recursive.NodePath || Pairs.Node2 || ',' AS varchar(2000)) AS NodePath
        , CTE_Recursive.Lvl + 1 AS Lvl
    FROM
        Pairs
        INNER JOIN CTE_Recursive ON CTE_Recursive.Node2 = Pairs.Node1
    WHERE
        CTE_Recursive.NodePath NOT LIKE CAST('%,' || Pairs.Node2 || ',%' AS varchar(2000))
)
,CTE_Result
AS
(
    SELECT Node1 AS Node
    FROM CTE_Recursive

    UNION

    SELECT Node2 AS Node
    FROM CTE_Recursive
)
SELECT Node FROM CTE_Result
;

This query will return all nodes that are connected to the given starting node, ie those that form a subgraph. Save its result set into a Subgraph temp table.

Append result into the final Result table and allocate an ID to the found subgraph.

INSERT INTO Result (GroupID, Node)
SELECT :GroupID, Node
FROM Subgraph;

Delete processed nodes from Nodes and Pairs .

DELETE FROM Nodes
WHERE Node IN (SELECT Node FROM Subgraph)
;

DELETE FROM Pairs
WHERE 
    Node1 IN (SELECT Node FROM Subgraph)
    OR
    Node2 IN (SELECT Node FROM Subgraph)
;

Clean the Subgraph table

DELETE FROM Subgraph; 

Return to the start of the loop.

In the end Result table will have all nodes with the corresponding ID of the subgraph.


Actually, you can simplify it further. You don't need the Nodes table, Pairs is enough.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM