简体   繁体   中英

What does this regular expression mean in Oracle?

SELECT  REGEXP_REPLACE(LISTAGG(A.ID, ',') WITHIN GROUP (ORDER BY A.ID), '([^,]+)(,\1)+', '\1')
FROM    TABLE A

I don't know what "\1" means in the above SQL. After creating a list by separating "A.ID" with commas through "LISTAGG", the purpose seems to be to remove the duplicate "A.ID", but I want to know the exact meaning.

For reference, "A.ID" is a NUMBER(4) column type. (eg 1111, 2222...)

It means that you are attempting to replace duplicates.

REGEXP_REPLACE(
  comma_separated_list,
  '([^,]+)(,\1)+',
  '\1'
)

Then:

  • ([^,]+) will match one-or-more non-comma characters and store the value in a capturing group.
  • ,\1 will match a comma and then the value from that first capturing group.
  • (,\1)+ matches a comma and then the value from that first capturing group and matches it all one-or-more times.

However, it does not work reliably. If you have the list 12,123,123,123,123,1234,4444

Then:

  • the first match will be:

     12,123,123,123,123,1234,4444 ^^^^^

    at the start and replace it with just 12 giving:

     123,123,123,123,1234,4444 ^^

    It has already gone wrong as the match did not match a complete element and you have lost the first id value.

  • the second match will start after the first match will skip the 3, characters and match:

     123,123,123,123,1234,4444 ^^^^^^^^^^^^^^^

    and replace it with 123 giving:

     123,1234,4444 ^^^

    Again, the match was wrong as it did not match a complete element and it is only coincidence that the value output is correct.

  • and the final match will be:

     123,1234,4444 ^^^

    replacing it with just 4 and giving the output:

     123,1234444

    Which is very wrong as you are now missing the id 12 and have an incorrect id of 1234444 .


What you should probably be doing is filtering the duplicates before aggregating.

In newer Oracle versions it is simply:

SELECT  LISTAGG(DISTINCT ID, ',') WITHIN GROUP (ORDER BY ID)
FROM    TABLE

or in older versions:

SELECT  LISTAGG(ID, ',') WITHIN GROUP (ORDER BY ID)
FROM    (
  SELECT DISTINCT id FROM TABLE
)

If you did want to use regular expressions (which will be a more inefficient than using DISTINCT ) then you can double up the delimiters and ensure you always match complete elements using (,[^,]+,)\1+ but then you also need to remove the repeated delimiters after de-duplication (which makes an inefficient solution even more inefficient):

SELECT  TRIM(
          BOTH ',' FROM
          REPLACE(
            REGEXP_REPLACE(
              LISTAGG(',' || ID || ',') WITHIN GROUP (ORDER BY ID),
              '(,[^,]+,)\1+',
              '\1'
            ),
            ',,',
            ','
          )
        ) AS no_duplicates
FROM    TABLE_NAME

db<>fiddle here

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM