简体   繁体   中英

String Comparison differences between .NET and T-SQL?

In a test case I've written, the string comparison doesn't appear to work the same way between SQL server / .NET CLR.

This C# code:

string lesser =  "SR2-A1-10-90";
string greater = "SR2-A1-100-10";

Debug.WriteLine(string.Compare("A","B"));
Debug.WriteLine(string.Compare(lesser, greater));

Will output:

-1
1

This SQL Server code:

declare @lesser varchar(20);
declare @greater varchar(20);

set @lesser =  'SR2-A1-10-90';
set @greater = 'SR2-A1-100-10';

IF @lesser < @greater
    SELECT 'Less Than';
ELSE
    SELECT 'Greater than';

Will output:

Less Than

Why the difference?

This is documented here .

Windows collations (eg Latin1_General_CI_AS ) use Unicode type collation rules. SQL Collations don't.

This causes the hyphen character to be treated differently between the two.

Further to gbn's answer, you can make them behave the same by using CompareOptions.StringSort in C# (or by using StringComparison.Ordinal). This treats symbols as occurring before alphanumeric symbols, so "-" < "0".

However, Unicode vs ASCII doesn't explain anything, as the hex codes for the ASCII codepage are translated verbatim to the Unicode codepage: "-" is 002D (45) while "0" is 0030 (48).

What is happening is that .NET is using "linguistic" sorting by default, which is based on a non-ordinal ordering and weight applied to various symbols by the specified or current culture. This linguistic algorithm allows, for instance, "résumé" (spelled with accents) to appear immediately following "resume" (spelled without accents) in a sorted list of words, as "é" is given a fractional order just after "e" and well before "f". It also allows "cooperation" and "co-operation" to be placed closely together, as the dash symbol is given low "weight"; it matters only as the absolute final tiebreakers when sorting words like "bits", "bit's", and "bit-shift" (which would appear in that order).

So-called ordinal sorting (strictly according to Unicode values, with or without case insensitivity) will produce very different and sometimes illogical results, as variants of letters usually appear well after the basic undecorated Latin alphabet in ASCII/Unicode ordinals, while symbols occur before it. For instance, "é" comes after "z" and so the words "resume", "rosin", "ruble", "résumé" would be sorted in that order. "Bit's", "Bit-shift", "Biter", "Bits" would be sorted in that order as the apostrophe comes first, followed by the dash, then the letter "e", then the letter "s". Neither of these seem logical from a "natural language" perspective.

  • In SQL you used varchar which is basically ASCII (subject to collation) which will give - before 0
  • In C# all strings are Unicode

The finer points of UTF-xx (c#) vs UCS-2 (SQL Server) are quite tricky.

Edit:

I posted too soon

I get "Greater Than" on SQL Server 2008 with collation Latin1_General_CI_AS

Edit 2:

I'd also try SELECT ASCII(...) on your dash. For example, if the SQL snippet has ever been in a Word document the - (150) is not the - (45) I copied into SQL Server for testing out of my browser from your questions. See CP 1252 (= CP1 = SQL Server lingo)

Edit 3: See Martin Smith's answer: the 2 collations have different sort orders.

Several great answers already on why this happens, but I'm sure others just want to know the C# code to iterate the collection in the same order as SQL server. I have found the following works best. "Ordinal" gets around the hyphen issue while "IgnoreCase" seems to reflect the SQL server default as well.

Debug.WriteLine(string.Compare(lesser, greater, StringComparison.OrdinalIgnoreCase));

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM