简体   繁体   中英

oracle SQL — Replace comma separated string with another set of values

I have the following table:

B_ID    I_ID    R_ID
W00001  1234    1235,1237
B00001  1235    1236,1235
T00001  1236    1235,1235,1235
X00001  1237    1234,1236,1238
M00001  1238    1238

I need output like below using sql

B_ID    I_ID    R_ID
W00001  1234    B00001|X00001
B00001  1235    T00001|B00001
T00001  1236    B00001
X00001  1237    W00001|T00001|M00001
M00001  1238    M00001
  1. R_ID should match its value with I_ID and pick the corresponding B_ID.
  2. Commas present in R_ID column should replaced with '|' delimiter.

Example: 1st row R_ID has values 1235,1237. 1235 and 1237 is present in I_ID so their corresponding B_ID is picked ie B00001,X00001 and expected output is B00001|X00001

Without duplicates and does not rely on any magic numbers:

Option 1: Hierarchical Query

Oracle Setup :

CREATE TABLE test_data ( b_id, i_id, r_id ) as
select 'W00001', 1234, '1235,1237'      from dual union all
select 'B00001', 1235, '1236,1235'      from dual union all
select 'T00001', 1236, '1235,1235,1235' from dual union all
select 'X00001', 1237, '1234,1236,1238' from dual union all
select 'M00001', 1238, '1238'           from dual;

Query :

SELECT b_id,
       i_id,
       ( SELECT LISTAGG( t.b_id, '|' ) WITHIN GROUP ( ORDER BY ROWNUM )
         FROM   TABLE( CAST( MULTISET(
                  SELECT DISTINCT
                         TO_NUMBER( REGEXP_SUBSTR( d.r_id, '\d+', 1, LEVEL ) )
                  FROM   DUAL
                  CONNECT BY LEVEL <= REGEXP_COUNT( d.r_id, '\d+' )
                ) AS SYS.ODCINUMBERLIST ) ) v
                INNER JOIN test_data t
                ON (v.COLUMN_VALUE = t.i_id) ) AS r_id
FROM   test_data d;

Explanation

The inner correlated select:

SELECT DISTINCT
       TO_NUMBER( REGEXP_SUBSTR( d.r_id, '\d+', 1, LEVEL ) )
FROM   DUAL
CONNECT BY LEVEL <= REGEXP_COUNT( d.r_id, '\d+' )

Takes the r_id for a single row and and separates it into one row per comma-delimited value; the DISTINCT clause means only unique values are output.

This is converted into a table collection expression using TABLE( CAST( MULTISET( ... ) AS collection_type ) ) so that it can be joined to another table.

This is then self-joined back to test_data to convert from displaying i_id s to b_id s and LISTAGG() is used to re-aggregate the multiple rows back to a single row.

Output :

B_ID         I_ID R_ID
------ ---------- --------------------
W00001       1234 B00001|X00001
B00001       1235 T00001|B00001
T00001       1236 B00001
X00001       1237 W00001|T00001|M00001
M00001       1238 M00001

Option 2: Using PL/SQL

Oracle Setup :

CREATE OR REPLACE TYPE numberlist IS TABLE OF NUMBER;
/

CREATE OR REPLACE FUNCTION split_Number_List(
  i_str    IN  VARCHAR2,
  i_delim  IN  VARCHAR2 DEFAULT ','
) RETURN numberlist DETERMINISTIC
AS
  p_result       numberlist := numberlist();
  p_start        NUMBER(5) := 1;
  p_end          NUMBER(5);
  c_len CONSTANT NUMBER(5) := LENGTH( i_str );
  c_ld  CONSTANT NUMBER(5) := LENGTH( i_delim );
BEGIN
  IF c_len > 0 THEN
    p_end := INSTR( i_str, i_delim, p_start );
    WHILE p_end > 0 LOOP
      p_result.EXTEND;
      p_result( p_result.COUNT ) := TO_NUMBER( SUBSTR( i_str, p_start, p_end - p_start ) );
      p_start := p_end + c_ld;
      p_end := INSTR( i_str, i_delim, p_start );
    END LOOP;
    IF p_start <= c_len + 1 THEN
      p_result.EXTEND;
      p_result( p_result.COUNT ) := TO_NUMBER( SUBSTR( i_str, p_start, c_len - p_start + 1 ) );
    END IF;
  END IF;
  RETURN p_result;
END;
/

Query :

SELECT b_id,
       i_id,
       ( SELECT LISTAGG( t.b_id, '|' ) WITHIN GROUP ( ORDER BY ROWNUM )
         FROM   TABLE( SET( split_Number_List( d.r_id ) ) ) v
                INNER JOIN test_data t
                ON (v.COLUMN_VALUE = t.i_id) ) AS r_id
FROM   test_data d;

(Same output as above)

Option 3 :

SELECT b_id,
       i_id,
       ( SELECT LISTAGG( t.b_id, '|' ) WITHIN GROUP ( ORDER BY ROWNUM )
         FROM   test_data t
         WHERE  ',' || d.r_id || ',' LIKE '%,' || t.i_id || ',%'  ) AS r_id
FROM   test_data d;

You can improve performance of this option by using a function-based on ',' || r_id || ','',' || r_id || ',' ',' || r_id || ',' and on '%,' || i_id || ',%' '%,' || i_id || ',%' '%,' || i_id || ',%' .

(Same output as above)

In the solution below, I use a standard technique to split each comma-separated string into components (tokens) in factored subquery prep . Then I join back to the original table to replace each token (which is an i_id ) with the corresponding b_id , and then put the tokens back together into pipe-separated strings with listagg() .

Note : This solution assumes that each r_id has fewer than 100 tokens (see the "magic number" 100 in the definition of idx ). If it is known that each r_id will have no more than 9 tokens, then 100 can be changed to 10 (resulting in faster processing). If NO upper bound is known beforehand, you can change 100 to some ridiculously large number; 4000 will do if r_id is anything but a CLOB, as VARCHAR2 and such have a limit of 4000 characters.

Thanks to MT0 for reminding me to add this note.

with test_data ( b_id, i_id, r_id ) as (
       select 'W00001', 1234, '1235,1237'      from dual union all
       select 'B00001', 1235, '1236,1235'      from dual union all
       select 'T00001', 1236, '1235,1235,1235' from dual union all
       select 'X00001', 1237, '1234,1236,1238' from dual union all
       select 'M00001', 1238, '1238'           from dual
     ),
     idx ( n ) as (
       select level from dual connect by level < 100
     ),
     prep ( b_id, i_id, n, token ) as (
       select t.b_id, t.i_id, i.n,
                regexp_substr(t.r_id, '([^,]+)', 1, i.n, null, 1)
       from   test_data t join idx i
                          on i.n <= regexp_count(t.r_id, ',') + 1
     )
select p.b_id, p.i_id, 
       listagg(t.b_id, '|') within group (order by p.n) as r_id
from   prep p join test_data t
              on p.token = t.i_id
group by p.b_id, p.i_id
order by p.i_id;



B_ID         I_ID R_ID
------ ---------- ------------------------------
W00001       1234 B00001|X00001
B00001       1235 T00001|B00001
T00001       1236 B00001|B00001|B00001
X00001       1237 W00001|T00001|M00001
M00001       1238 M00001

Added information based on further conversation with MT0.

I edited this "Added information" again based on even more conversation with MT0. Thank you MT0 for keeping me on my toes!

In the solution below I do away with the magic number 100 and instead I use a common technique to deal with multiple input rows and connect by level from dual. I also show a common technique for dealing with duplicates (in the tokens obtained from the comma-separated input strings).

Query :

with 
     test_data ( b_id, i_id, r_id ) as (
           select 'W00001', 1234, '1235,1237'      from dual union all
           select 'B00001', 1235, '1236,1235'      from dual union all
           select 'T00001', 1236, '1235,1235,1235' from dual union all
           select 'X00001', 1237, '1234,1236,1238' from dual union all
           select 'M00001', 1238, '1238'           from dual
         ),
     prep ( b_id, i_id, n, token ) as (
       select b_id, i_id, level,
                regexp_substr(r_id, '([^,]+)', 1, level, null, 1)
       from   test_data t
       connect by level <= regexp_count(r_id, ',') + 1
              and prior r_id = r_id             -- to only generate the rows needed
              and prior sys_guid() is not null  -- this is unique, to avoid cycles
     ),
     z ( b_id, i_id, n, token, rn ) as (
       select b_id, i_id, n, token,
              row_number() over (partition by i_id, token order by n)
       from prep
     )
select z.b_id, z.i_id, 
       listagg(t.b_id, '|') within group (order by z.n) as r_id
from   z join test_data t
              on z.token = t.i_id
where  z.rn = 1
group by z.b_id, z.i_id
order by i_id;

Result :

B_ID         I_ID R_ID
------ ---------- ------------------------------
W00001       1234 B00001|X00001
B00001       1235 T00001|B00001
T00001       1236 B00001
X00001       1237 W00001|T00001|M00001
M00001       1238 M00001

5 rows selected.

Decided to add another answer since it uses an entirely different method - the recursive subquery factoring, available since Oracle version 11.2.

I did some testing, with an input (persistent) table called test_data with 9000 rows, each r_id being a comma-separated string of 200 tokens; structure very similar to the OP's small sample in the original post. I tried three methods: the hierarchical query (using connect by with the prior sys_guid() trick), which I proposed; the solution based on a correlated subquery and nested table, posted by MT0; and the recursive query I will show below. In each case, I used the query as the select... portion of a CTAS statement.

  • The hierarchical query took 39 minutes to complete.
  • The correlated subquery and nested table method took 18 minutes.
  • The recursive subquery factoring method took 13 minutes.

(To compare apples to apples I modified MT0's query, removing the extra information that the "tokens" in r_id are numbers - I treated them as strings, as do the other two methods.)

Recursive query:

with
     prep ( b_id, i_id, str, n, st_pos, end_pos, token) as (
       select  b_id, i_id, ',' || r_id || ',', -1, null, 1, null
         from  test_data
       union all
       select  b_id, i_id, str, n+1, end_pos + 1, instr(str, ',', 1, n+3),
               substr(str, st_pos, end_pos - st_pos)
         from  prep
         where end_pos != 0
     ),
     z ( b_id, i_id, n, token, rn ) as (
       select b_id, i_id, n, token,
              row_number() over (partition by i_id, token order by n)
       from prep
     )
select z.b_id, z.i_id, 
       listagg(t.b_id, '|') within group (order by z.n) as r_id
from   z join test_data t
              on z.token = t.i_id
where  z.rn = 1
group by z.b_id, z.i_id
;

Actually one can squeeze a little bit of extra performance; in the anchor part of the recursive CTE (the first member of the union all in the definition of prep ), I could start from n = 0 , st_pos = 1 and end_pos = the position of the first comma (actually the second in comma in the altered string; I find it a lot easier to add commas at the beginning and the end of the input CSV string and write the recursive CTE as I did.) However, this saves just one iteration out of 200 for each string; so a 0.5% of execution time could be saved. I find the way I wrote the recursive CTE easier to follow.

For completeness, here is the modified version of the "nested table" method I used (credit @MT0):

select b_id,
       i_id,
       ( select listagg(t.b_id, '|') within group (order by rownum)
         from   table ( 
                        cast ( 
                               multiset (
                                  select distinct regexp_substr(d.r_id, '[^,]+', 1, level) 
                                  from dual
                                  connect by level <= regexp_count(d.r_id, ',') + 1
                               ) 
                               as sys.odcivarchar2list
                        ) 
                ) v
                inner join test_data t
                on (v.column_value = t.i_id)
       )
from   test_data d;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM