I want to use REGEXP_INSTR()
within an oracle database to check for lower/uppercase characters. I'm aware of [:upper:]
and [:lower:]
POSIX character classes, but I went with az
that gives me really weird results I don't understand. Can someone explain this?
SELECT REGEXP_INSTR('abc','[A-Z]',1,1,0,'c') FROM DUAL
-- Got 2, expected 0
SELECT REGEXP_INSTR('zyx','[A-Z]',1,1,0,'c') FROM DUAL
-- Got 1, expected 0
SELECT REGEXP_INSTR('ABC','[a-z]',1,1,0,'c') FROM DUAL
-- Got 1, expected 0
SELECT REGEXP_INSTR('ZYX','[a-z]',1,1,0,'c') FROM DUAL
-- Got 2, expected 0
SELECT REGEXP_INSTR('a3','[A-F0-9]',1,1,0,'c') FROM DUAL
-- Got 2, expected 2
SELECT REGEXP_INSTR('b3','[A-F0-9]',1,1,0,'c') FROM DUAL
-- Got 1, expected 2
SELECT REGEXP_INSTR('b3','[A-F0-9]') FROM DUAL
-- Got 1, expected 1 or 2
SELECT REGEXP_INSTR('a3','[A-F0-9]') FROM DUAL
-- Got 2, expected same as above
The reason for the behavior is the collation rules. See the NLS_SORT
documentation :
- If the value is BINARY, then the collating sequence for ORDER BY queries is based on the numeric value of characters (a binary sort that requires less system overhead).
- If the value is a named linguistic sort, sorting is based on the order of the defined linguistic sort. Most (but not all) languages supported by the NLS_LANGUAGE parameter also support a linguistic sort with the same name.
Set the NLS_SORT
to BINARY
so that the [AZ]
could be parsed in the same order as in the ASCII table,
alter session set nls_sort = 'BINARY'
Then, you will get consistent results.
See the online demo .
Okay, the answer that NLS_SORT
causes this behavior is correct, but I don't think it explains it in an understandable way. None of the documentation I found actually does that...
You have to imagine that the character ranges defined by [az]
are actually derived from a single substring of all possible characters which are sorted depending on NLS_SORT
.
Lets assume the whole alphabet is just alphanumerical characters. Sorted by BINARY
this results in a base string like 0123456789abcdefgh...xyzABCDE...XYZ
. Derived from this, [0-6]
expands to [0123456]
, [af]
to [abcdef]
, [5-b]
to [56789ab]
etc.
Sorted by a linguistic_definition
however results in a different base string, like 0123456789aAbBcCdDeF...xXyYzZ
. Derived from this, [0-6]
still expands to [0123456]
, but [af]
now expands to [aAbBcCdDeEf]
and [5-b]
to [56789aAb]
etc...
This is why a
did not match [AZ]
, but b
did. [AZ]
actually expands to [AbBcC...yYzZ]
which includes z
but not a
.
In reality [AZ]
might even contain more characters, like [aAàáâÀÁÂ...]
etc.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.