Oracle Regexp range doesn't work properly

Question

I am working with one of example schemas in Oracle - Order Entry, table - product_information.

I noticed that range expressions in my Regex functions doesn't work as it is expected to do.

Does anyone has an idea why results of below code include also small letters? Output is presented on the picture. It seems that it is a problem with encoding? Normally, AZ are before az and here it seems that encoding has mixed capital and small letters...

select substr(product_description, 1, 25),
    regexp_substr(product_description, '[A-M]+') as reg0,
    regexp_substr(product_description, '[A-M]+', 1, 1,'i') as reg1,
    regexp_substr(product_description, '[A-M]+', 1, 1,'c')  as reg2
from product_information;

Answer 1

The documentation says Oracle "interprets range expressions as specified by the NLS_SORT parameter to determine the collation elements covered by a given range".

With the default BINARY collation and some (simplified) sample data, your query gives:

SUBSTR(PRODUCT_DESCRIPTION,1,25)	REG0	REG1	REG2
Liquid...	L	Li	L
CRT...	C	C	C
Monitor...	M	M	M
10 inch...	null	i	null

With NLS_SORT changed to, for example, Polish (guessing from your profile) the same query gets what you are seeing:

SUBSTR(PRODUCT_DESCRIPTION,1,25)	REG0	REG1	REG2
Liquid...	Li	Li	Li
CRT...	C	C	C
Monitor...	M	M	M
10 inch...	i	i	i

As @LukStorms pointed out, you can override the session setting by specifying collate binary in the function call, at least in recent versions (12.2+ I believe):

select substr(product_description, 1, 25),
    regexp_substr(product_description collate binary, '[A-M]+') as reg0,
    regexp_substr(product_description collate binary, '[A-M]+', 1, 1,'i') as reg1,
    regexp_substr(product_description collate binary, '[A-M]+', 1, 1,'c')  as reg2
from product_information;

db<>fiddle demo, including the session setting override.

Refer to earlier sections of the same documentation for an explanation of collation and the difference between binary and linguistic collation. Your statement that "Normally, AZ are before az" reflects how binary collation works, at least for those ASCII ranges:

One way to sort character data is based on the numeric values of the characters defined by the character encoding scheme. This is called a binary collation. Binary collation is the fastest type of sort. It produces reasonable results for the English alphabet because the ASCII and EBCDIC standards define the letters A to Z in ascending numeric value.

With linguistic collation it's more complicated than that.

Oracle Regexp range doesn't work properly

Question

1 answers

solution1
4 ACCPTED 2022-01-12 12:51:54

Oracle Regexp range doesn't work properly

Question

1 answers

solution1 4 ACCPTED 2022-01-12 12:51:54

solution1
4 ACCPTED 2022-01-12 12:51:54