Select rows where first, last or both characters are special or punctuation, unless they only have a period at the end

Question

I need to retrieve rows from my table which have names that start or end with a [:space:] or other special character [:punct:] , excluding a single dot ( . ) at the end of the name. The idea is to pull names that are possible inconsistencies.

Examples that must appear:

'GEORGE & SON ' - has an extra space at the end.
'-GEORGE & SON' - has an extra - at the start.
'&GEORGE & SON' - has an extra & at the start.
'-GEORGE & SON SA' - has an extra - at the start. The dot . at the end is not a problem.
'GEORGE & SON..' - has not one, but two dots at the end. That is an exception for the strings that ends with more than one . ; they are also bad names.

examples that must not appear:

'GEORGE & SON.' - Only has an extra '.', at the end.

I'm using the expression:

REGEXP_LIKE(col, '(^[[:punct:]]|[[:punct:]]$)|(^[[:space:]]|[[:space:]]$)')

But in spite of retrieving names that start or end with space or special character, is also pulling the ones that have a dot '.' as the last character.

How can I change this to get the result I need?

Answer 1

just add {2} after the second [[:punct:]] . that meand that the dot should be there at least 2 times

with tab as(
  select 'GEORGE & SON ' as s from dual union all
  select '-GEORGE & SON'  as s from dual union all
  select '&GEORGE & SON'  as s from dual union all
  select 'GEORGE & SON..'  as s from dual union all
  select 'GEORGE & SON.'  as s from dual union all
  select '-GEORGE & SON S.A.' as s from dual  
)
select * from  tab 
where REGEXP_LIKE(s, '(^[[:punct:]]|[[:punct:]]{2}$)|(^[[:space:]]|[[:space:]]$)')

Answer 2

As the pre-defined punctuation character class does not work for the end of the string, a self defined character class is used instead. Leaving out the dot on purpose. Adding in the single quote separately (as escaping it didn't work and finding the right character for the q operator might be hard to do in this case). Adding the closing square bracket on its own as Oracle doesn't seem to handle it correctly when escaped. Finally adding the trailing consecutive dots explicitly:

WITH T (id, col) AS (
  SELECT 1, 'GEORGE & SON ' FROM DUAL UNION ALL
  SELECT 2, '-GEORGE & SON'  FROM DUAL UNION ALL
  SELECT 3, '&GEORGE & SON'  FROM DUAL UNION ALL
  SELECT 4, 'GEORGE & SON..'  FROM DUAL UNION ALL
  SELECT 5, 'GEORGE & SON.'  FROM DUAL UNION ALL
  SELECT 6, '-GEORGE & SON S.A.' FROM DUAL UNION ALL
  SELECT 7, 'GEORGE & SON!' FROM DUAL UNION ALL
  SELECT 8, 'GEORGE & SON"' FROM DUAL UNION ALL
  SELECT 9, 'GEORGE & SON#' FROM DUAL UNION ALL
  SELECT 10, 'GEORGE & SON$' FROM DUAL UNION ALL
  SELECT 11, 'GEORGE & SON%' FROM DUAL UNION ALL
  SELECT 12, 'GEORGE & SON&' FROM DUAL UNION ALL
  SELECT 13, 'GEORGE & SON(' FROM DUAL UNION ALL
  SELECT 14, 'GEORGE & SON)' FROM DUAL UNION ALL
  SELECT 15, 'GEORGE & SON*' FROM DUAL UNION ALL
  SELECT 16, 'GEORGE & SON+' FROM DUAL UNION ALL
  SELECT 17, 'GEORGE & SON,' FROM DUAL UNION ALL
  SELECT 18, 'GEORGE & SON\' FROM DUAL UNION ALL
  SELECT 19, 'GEORGE & SON-' FROM DUAL UNION ALL
  SELECT 20, 'GEORGE & SON\' FROM DUAL UNION ALL
  SELECT 21, 'GEORGE & SON/' FROM DUAL UNION ALL
  SELECT 22, 'GEORGE & SON:' FROM DUAL UNION ALL
  SELECT 23, 'GEORGE & SON;' FROM DUAL UNION ALL
  SELECT 24, 'GEORGE & SON<' FROM DUAL UNION ALL
  SELECT 25, 'GEORGE & SON=' FROM DUAL UNION ALL
  SELECT 26, 'GEORGE & SON>' FROM DUAL UNION ALL
  SELECT 27, 'GEORGE & SON?' FROM DUAL UNION ALL
  SELECT 28, 'GEORGE & SON@' FROM DUAL UNION ALL
  SELECT 29, 'GEORGE & SON[' FROM DUAL UNION ALL
  SELECT 30, 'GEORGE & SON^' FROM DUAL UNION ALL
  SELECT 31, 'GEORGE & SON_' FROM DUAL UNION ALL
  SELECT 32, 'GEORGE & SON`' FROM DUAL UNION ALL
  SELECT 33, 'GEORGE & SON{' FROM DUAL UNION ALL
  SELECT 34, 'GEORGE & SON|' FROM DUAL UNION ALL
  SELECT 35, 'GEORGE & SON}' FROM DUAL UNION ALL
  SELECT 36, 'GEORGE & SON~' FROM DUAL UNION ALL
  SELECT 37, 'GEORGE & SON''' FROM DUAL UNION ALL
  SELECT 38, 'GEORGE & SON]' FROM DUAL)
SELECT
  * FROM T
 WHERE REGEXP_LIKE(col, '(^[[:punct:]]|[-!"#$%&()*+,\/:;<=>?@[^_`{|}~' || '''' || ']$)|]$|\.\.$|(^[[:space:]]|[[:space:]]$)')
 ORDER BY id
;

Updated Requirements

Punctuation followed by a single dot

Add an optional dot to the set of special characters; going from

'[-!"#$%&()*+,\/:;<=>?@[^_`{|}~' || '''' || ']$'

to

'[-!"#$%&()*+,\/:;<=>?@[^_`{|}~' || '''' || ']\.?$'

as in

WITH T (id, col) AS (
  SELECT 40, 'GEORGE & SON^.'FROM DUAL UNION ALL
  SELECT 41, 'GEORGE & SON_.'FROM DUAL UNION ALL
  SELECT 42, 'GEORGE & SON`.'FROM DUAL UNION ALL
  SELECT 43, 'GEORGE & SON{.'FROM DUAL UNION ALL
  SELECT 44, 'GEORGE & SON|.'FROM DUAL UNION ALL
  SELECT 45, 'GEORGE & SON}.'FROM DUAL UNION ALL
  SELECT 46, 'GEORGE & SON~.'FROM DUAL UNION ALL
  SELECT 47, 'GEORGE & SON''.'FROM DUAL UNION ALL
  SELECT 48, 'GEORGE & SON].'FROM DUAL)
SELECT
  * FROM T
 WHERE REGEXP_LIKE(col, '([-!"#$%&()*+,\/:;<=>?@[^_`{|}~' || '''' || ']\.?$)|]\.?$')
 ORDER BY id
;

Repetition of (combinations of) blanks and special characters within the string

Originally, just leading and trailing occurrences were asked for… ;-)

Sequences of two or more space/punctuation characters are caught by

[[:space:][:punct:]]{2,}

If you want this explicitely inside strings, only - surround them with word characters:

\w[[:space:][:punct:]]{2,}\w

Leading/trailing consecutive spaces are already matched when are single one is found - no need for worry about them explicitely.
Which gives:

WITH T (id, col) AS (
  SELECT 50, 'GEORGE & SON  ' FROM DUAL UNION ALL
  SELECT 51, 'GEORGE & SON   '  FROM DUAL UNION ALL
  SELECT 52, '  GEORGE & SON'  FROM DUAL UNION ALL
  SELECT 53, '    GEORGE & SON'  FROM DUAL UNION ALL
  SELECT 54, 'GEORGE &  SON'  FROM DUAL UNION ALL
  SELECT 55, 'GEORGE  & SON S.A.' FROM DUAL UNION ALL
  SELECT 56, 'GEORGE & SON    S.A.' FROM DUAL UNION ALL
  SELECT 60, '  GEORGE and SON'  FROM DUAL UNION ALL
  SELECT 61, ' ,GEORGE and SON' FROM DUAL UNION ALL
  SELECT 62, ', GEORGE and SON'  FROM DUAL UNION ALL
  SELECT 63, 'GEORGE -- SON' FROM DUAL UNION ALL
  SELECT 64, 'GEORGE --SON' FROM DUAL UNION ALL
  SELECT 65, 'GEORGE & SON' FROM DUAL UNION ALL
  SELECT 66, 'GEORGE + SON' FROM DUAL UNION ALL
  SELECT 67, 'GEORGE and  , SON' FROM DUAL UNION ALL
  SELECT 68, 'GEORGE and , SON' FROM DUAL UNION ALL
  SELECT 69, 'GEORGE and SON ,'  FROM DUAL UNION ALL
  SELECT 70, 'GEORGE and SON. '  FROM DUAL UNION ALL
  SELECT 71, 'GEORGE and+-SON'  FROM DUAL)
SELECT
  * FROM T
--  WHERE REGEXP_LIKE(col, '(^[[:punct:]]|[-!"#$%&()*+,\/:;<=>?@[^_`{|}~' || '''' || ']\.?$)|]$|\.\.$|(^[[:space:]]|[[:space:]]$)|[[:space:][:punct:]]{2,}')
  WHERE REGEXP_LIKE(col, '(^[[:punct:]]|[-!"#$%&()*+,\/:;<=>?@[^_`{|}~' || '''' || ']\.?$)|]$|\.\.$|(^[[:space:]]|[[:space:]]$)|\w[[:space:][:punct:]]{2,}\w')
  ORDER BY id
;

But this will produce false positives, most prominently GEORGE & SON . This could, to some degree, be avoided by replacing [:punct:] with a less inclusive set. The (final) choice will depend on whether false negatives or false positives are more of a concern.

See it in action:

original: SQL Fiddle
updated: SQL Fiddle

Catch arbitrary sequences of punctuation and space characters - but allow for single letters followed by a single dot and a single blank

As mentioned before, false positives need to balanced against false negatives. One way or the other. However, this is probably a good time to consider breaking apart the overall problem into smaller ones - and handle them individually. Even if GEORGE and P. SON is perfectly acceptable, you probably want to review, eg, -GEORGE and P. SON . So let's focus on sequences of stray characters in the middle of the string - even remembering the ** & ** from before, and allowing enumerations (and thus commas):

WHERE
  REGEXP_LIKE(col, '\w[[:space:][:punct:]]{2,}\w')
  AND
  NOT REGEXP_LIKE(col, ' [[:upper:]]\. \w')
  AND
  NOT INSTR(col, ', ') > 0
  AND
  NOT INSTR(col, ' & ') > 0

Probably followed by

  WHERE
  REGEXP_LIKE(col, '\w[[:space:][:punct:]]{2,}\w')
  AND
  (REGEXP_LIKE(col, ' [[:upper:]]\. \w')
   OR
   INSTR(col, ', ') > 0
   OR
   INSTR(col, ' & ') > 0
  )

in order to find, eg, GEORGE and , SON between many valid ones. INSTR could be faster than REGEX - depending on overall circumstances…

A few more words on the mechanics

(i) [[:punct:][:space:]] essentially combines [[:punct:]] and [[:space:]] into a single character class. As far as selection from that class is concerned, the order is of no relevance.

(ii)

[-!"#$%&()*+,\/:;<=>?@[^_`{|}~' || '''' || ']

is

[-!"#$%&()*+,\/:;<=>?@[^_`{|}~]

with a single quote added. If this were tried directly, Oracle would consider the single quote to end the parameter value. And escaping the single quote with a backslash didn't work… So basically, this is what has been called above "Adding in the single quote separately".

Please comment, if and as this requires adjustment / further detail.

Select rows where first, last or both characters are special or punctuation, unless they only have a period at the end

Question

2 answers

solution1
0 2019-04-04 11:00:11

solution2
0 ACCPTED 2019-04-04 14:39:10

Updated Requirements

Punctuation followed by a single dot

Repetition of (combinations of) blanks and special characters within the string

Catch arbitrary sequences of punctuation and space characters - but allow for single letters followed by a single dot and a single blank

A few more words on the mechanics

Select rows where first, last or both characters are special or punctuation, unless they only have a period at the end

Question

2 answers

solution1 0 2019-04-04 11:00:11

solution2 0 ACCPTED 2019-04-04 14:39:10

Updated Requirements

Punctuation followed by a single dot

Repetition of (combinations of) blanks and special characters within the string

Catch arbitrary sequences of punctuation and space characters - but allow for single letters followed by a single dot and a single blank

A few more words on the mechanics

solution1
0 2019-04-04 11:00:11

solution2
0 ACCPTED 2019-04-04 14:39:10