简体   繁体   中英

Oracle REGEXP_INSTR() and “a-z” character range doesn't match as expected

I want to use REGEXP_INSTR() within an oracle database to check for lower/uppercase characters. I'm aware of [:upper:] and [:lower:] POSIX character classes, but I went with az that gives me really weird results I don't understand. Can someone explain this?

SELECT REGEXP_INSTR('abc','[A-Z]',1,1,0,'c') FROM DUAL
-- Got 2, expected 0

SELECT REGEXP_INSTR('zyx','[A-Z]',1,1,0,'c') FROM DUAL
-- Got 1, expected 0

SELECT REGEXP_INSTR('ABC','[a-z]',1,1,0,'c') FROM DUAL
-- Got 1, expected 0

SELECT REGEXP_INSTR('ZYX','[a-z]',1,1,0,'c') FROM DUAL
-- Got 2, expected 0

SELECT REGEXP_INSTR('a3','[A-F0-9]',1,1,0,'c') FROM DUAL
-- Got 2, expected 2

SELECT REGEXP_INSTR('b3','[A-F0-9]',1,1,0,'c') FROM DUAL
-- Got 1, expected 2

SELECT REGEXP_INSTR('b3','[A-F0-9]') FROM DUAL
-- Got 1, expected 1 or 2

SELECT REGEXP_INSTR('a3','[A-F0-9]') FROM DUAL
-- Got 2, expected same as above

The reason for the behavior is the collation rules. See the NLS_SORT documentation :

  • If the value is BINARY, then the collating sequence for ORDER BY queries is based on the numeric value of characters (a binary sort that requires less system overhead).
  • If the value is a named linguistic sort, sorting is based on the order of the defined linguistic sort. Most (but not all) languages supported by the NLS_LANGUAGE parameter also support a linguistic sort with the same name.

Set the NLS_SORT to BINARY so that the [AZ] could be parsed in the same order as in the ASCII table,

alter session set nls_sort = 'BINARY'

Then, you will get consistent results.

See the online demo .

Okay, the answer that NLS_SORT causes this behavior is correct, but I don't think it explains it in an understandable way. None of the documentation I found actually does that...

You have to imagine that the character ranges defined by [az] are actually derived from a single substring of all possible characters which are sorted depending on NLS_SORT .

Lets assume the whole alphabet is just alphanumerical characters. Sorted by BINARY this results in a base string like 0123456789abcdefgh...xyzABCDE...XYZ . Derived from this, [0-6] expands to [0123456] , [af] to [abcdef] , [5-b] to [56789ab] etc.

Sorted by a linguistic_definition however results in a different base string, like 0123456789aAbBcCdDeF...xXyYzZ . Derived from this, [0-6] still expands to [0123456] , but [af] now expands to [aAbBcCdDeEf] and [5-b] to [56789aAb] etc...

This is why a did not match [AZ] , but b did. [AZ] actually expands to [AbBcC...yYzZ] which includes z but not a .

In reality [AZ] might even contain more characters, like [aAàáâÀÁÂ...] etc.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM