What does this regular expression mean in Oracle?

Question

SELECT  REGEXP_REPLACE(LISTAGG(A.ID, ',') WITHIN GROUP (ORDER BY A.ID), '([^,]+)(,\1)+', '\1')
FROM    TABLE A

I don't know what "\1" means in the above SQL. After creating a list by separating "A.ID" with commas through "LISTAGG", the purpose seems to be to remove the duplicate "A.ID", but I want to know the exact meaning.

For reference, "A.ID" is a NUMBER(4) column type. (eg 1111, 2222...)

Answer 1

It means that you are attempting to replace duplicates.

REGEXP_REPLACE(
  comma_separated_list,
  '([^,]+)(,\1)+',
  '\1'
)

Then:

([^,]+) will match one-or-more non-comma characters and store the value in a capturing group.
,\1 will match a comma and then the value from that first capturing group.
(,\1)+ matches a comma and then the value from that first capturing group and matches it all one-or-more times.

However, it does not work reliably. If you have the list 12,123,123,123,123,1234,4444

Then:

the first match will be:
```
 12,123,123,123,123,1234,4444 ^^^^^
```
at the start and replace it with just 12 giving:
```
 123,123,123,123,1234,4444 ^^
```
It has already gone wrong as the match did not match a complete element and you have lost the first id value.
the second match will start after the first match will skip the 3, characters and match:
```
 123,123,123,123,1234,4444 ^^^^^^^^^^^^^^^
```
and replace it with 123 giving:
```
 123,1234,4444 ^^^
```
Again, the match was wrong as it did not match a complete element and it is only coincidence that the value output is correct.
and the final match will be:
```
 123,1234,4444 ^^^
```
replacing it with just 4 and giving the output:
```
 123,1234444
```
Which is very wrong as you are now missing the id 12 and have an incorrect id of 1234444 .

What you should probably be doing is filtering the duplicates before aggregating.

In newer Oracle versions it is simply:

SELECT  LISTAGG(DISTINCT ID, ',') WITHIN GROUP (ORDER BY ID)
FROM    TABLE

or in older versions:

SELECT  LISTAGG(ID, ',') WITHIN GROUP (ORDER BY ID)
FROM    (
  SELECT DISTINCT id FROM TABLE
)

If you did want to use regular expressions (which will be a more inefficient than using DISTINCT ) then you can double up the delimiters and ensure you always match complete elements using (,[^,]+,)\1+ but then you also need to remove the repeated delimiters after de-duplication (which makes an inefficient solution even more inefficient):

SELECT  TRIM(
          BOTH ',' FROM
          REPLACE(
            REGEXP_REPLACE(
              LISTAGG(',' || ID || ',') WITHIN GROUP (ORDER BY ID),
              '(,[^,]+,)\1+',
              '\1'
            ),
            ',,',
            ','
          )
        ) AS no_duplicates
FROM    TABLE_NAME

db<>fiddle here

What does this regular expression mean in Oracle?

Question

1 answers

solution1
1 ACCPTED 2022-06-23 10:41:22

What does this regular expression mean in Oracle?

Question

1 answers

solution1 1 ACCPTED 2022-06-23 10:41:22

solution1
1 ACCPTED 2022-06-23 10:41:22