Oracle REGEXP_INSTR() and “a-z” character range doesn't match as expected

Question

I want to use REGEXP_INSTR() within an oracle database to check for lower/uppercase characters. I'm aware of [:upper:] and [:lower:] POSIX character classes, but I went with az that gives me really weird results I don't understand. Can someone explain this?

SELECT REGEXP_INSTR('abc','[A-Z]',1,1,0,'c') FROM DUAL
-- Got 2, expected 0

SELECT REGEXP_INSTR('zyx','[A-Z]',1,1,0,'c') FROM DUAL
-- Got 1, expected 0

SELECT REGEXP_INSTR('ABC','[a-z]',1,1,0,'c') FROM DUAL
-- Got 1, expected 0

SELECT REGEXP_INSTR('ZYX','[a-z]',1,1,0,'c') FROM DUAL
-- Got 2, expected 0

SELECT REGEXP_INSTR('a3','[A-F0-9]',1,1,0,'c') FROM DUAL
-- Got 2, expected 2

SELECT REGEXP_INSTR('b3','[A-F0-9]',1,1,0,'c') FROM DUAL
-- Got 1, expected 2

SELECT REGEXP_INSTR('b3','[A-F0-9]') FROM DUAL
-- Got 1, expected 1 or 2

SELECT REGEXP_INSTR('a3','[A-F0-9]') FROM DUAL
-- Got 2, expected same as above

Answer 1

The reason for the behavior is the collation rules. See the NLS_SORT documentation :

If the value is BINARY, then the collating sequence for ORDER BY queries is based on the numeric value of characters (a binary sort that requires less system overhead).

If the value is a named linguistic sort, sorting is based on the order of the defined linguistic sort. Most (but not all) languages supported by the NLS_LANGUAGE parameter also support a linguistic sort with the same name.

Set the NLS_SORT to BINARY so that the [AZ] could be parsed in the same order as in the ASCII table,

alter session set nls_sort = 'BINARY'

Then, you will get consistent results.

See the online demo .

Answer 2

Okay, the answer that NLS_SORT causes this behavior is correct, but I don't think it explains it in an understandable way. None of the documentation I found actually does that...

You have to imagine that the character ranges defined by [az] are actually derived from a single substring of all possible characters which are sorted depending on NLS_SORT .

Lets assume the whole alphabet is just alphanumerical characters. Sorted by BINARY this results in a base string like 0123456789abcdefgh...xyzABCDE...XYZ . Derived from this, [0-6] expands to [0123456] , [af] to [abcdef] , [5-b] to [56789ab] etc.

Sorted by a linguistic_definition however results in a different base string, like 0123456789aAbBcCdDeF...xXyYzZ . Derived from this, [0-6] still expands to [0123456] , but [af] now expands to [aAbBcCdDeEf] and [5-b] to [56789aAb] etc...

This is why a did not match [AZ] , but b did. [AZ] actually expands to [AbBcC...yYzZ] which includes z but not a .

In reality [AZ] might even contain more characters, like [aAàáâÀÁÂ...] etc.

Oracle REGEXP_INSTR() and “a-z” character range doesn't match as expected

Question

2 answers

solution1
1 2019-09-18 07:41:14

solution2
-1 2019-09-18 17:01:38

Oracle REGEXP_INSTR() and “a-z” character range doesn't match as expected

Question

2 answers

solution1 1 2019-09-18 07:41:14

solution2 -1 2019-09-18 17:01:38

solution1
1 2019-09-18 07:41:14

solution2
-1 2019-09-18 17:01:38