[英]How to map SQL collation setting to a Java comparator?
Is there a way to translate a database's collation setting (eg SQL_Latin1_General_CP1_CI_AS
) to a Java Comparator
implementation so I can apply the same ordering as the database does, using in Java code? 有没有一种方法可以将数据库的排序SQL_Latin1_General_CP1_CI_AS
设置(例如SQL_Latin1_General_CP1_CI_AS
)转换为Java Comparator
实现,以便我可以在Java代码中使用与数据库相同的排序方式?
Is there an existing library that already provides this mapping? 是否有一个已经提供此映射的现有库?
Simplistically, you can use the COLLATIONPROPERTY function, which gives you: 简单来说,您可以使用COLLATIONPROPERTY函数,该函数为您提供:
The ComparisonStyle
is a bit-masked field that is encoded as follows: ComparisonStyle
是一个位掩码字段,其编码如下:
Unfortunately, everything-sensitive (eg Latin1_General_CS_AS_KS_WS
) equates to 0. This is unfortunate because both _BIN
and _BIN2
collations also equate to 0. Hence you still need to check the name to see if it ends in _BIN%
to get the full picture. 不幸的是,所有敏感的东西(例如Latin1_General_CS_AS_KS_WS
)都等于0。这很不幸,因为_BIN
和_BIN2
归类也都等于0。因此,您仍然需要检查名称以查看其名称是否以_BIN%
结尾才能获得全貌。
But, this is not so simple. 但是,这并不是那么简单。 There are two main types of Collations: SQL Server collations and Windows Collations. 归类主要有两种类型:SQL Server归类和Windows归类。
The SQL Server collations (ie starting with SQL_
) are deprecated and should not be used anymore, though a lot of systems to default to SQL_Latin1_General_CP1_CI_AS
. 尽管许多系统默认使用SQL_Latin1_General_CP1_CI_AS
,但SQL Server排序规则(即以SQL_
开头)已被弃用,不应再使用。
For both types of collations, NCHAR / NVARCHAR / XML data uses the Unicode sorting algorithms. 对于两种归类,NCHAR / NVARCHAR / XML数据均使用Unicode排序算法。 For non-Unicode data, the Windows collations should sort the same between SQL Server and .NET. 对于非Unicode数据,Windows排序规则应在SQL Server和.NET之间进行排序。 However, for the SQL Server collations, the sorting algorithm does not necessarily match to the Windows collation (or possibly anything). 但是,对于SQL Server归类,排序算法不一定与Windows归类(或可能的任何事物)匹配。 But they do have their own Sort Order IDs and there might be public documentation describing those rules. 但是它们确实有自己的排序顺序ID,并且可能会有描述这些规则的公共文档。
The Windows collations have several variations: Windows归类具有多种变体:
differing versions: unspecified should be the original set, then the first set of updates are labeled _90
and the newest updates are the _100
series. 不同版本:未指定应为原始版本,然后第一组更新标记为_90
,最新更新为_100
系列。
differing binary ordering: the older _BIN
collations do not map to anything exactly in .NET since they compared the first character as a character. 不同的二进制顺序:较旧的_BIN
归类无法将.NET完全映射到任何内容,因为它们将第一个字符作为字符进行了比较。 The newer _BIN2
collations are pure code-point comparisons and ordering and should map to the ordinal
ComparisonStyle. 较新的_BIN2
归类是纯代码点比较和排序,应映射到ordinal
ComparisonStyle。
Beyond the specifics of any particular collation, there is another factor complicating what you are trying to accomplish: the default collation for a database does not necessarily determine the collation used for sorting / comparing a particular predicate or field! 除了任何特定归类的细节之外,还有另一个因素使您要完成的工作变得复杂:数据库的默认归类不一定确定用于排序/比较特定谓词或字段的归类! The collation can be taken from the field being operated on, it can be taken from the database default for string literals and variables, or it can be overridden in both cases via the COLLATE
clause. 排序规则可以从正在操作的字段中获取,也可以从数据库默认的字符串文字和变量中获取,或者在两种情况下都可以通过COLLATE
子句覆盖它。 Please see the MSDN page for Collation Precedence for more details. 有关更多详细信息,请参见MSDN页面中的排序规则优先级 。
In the end, there is no deterministic means of getting the collation(s) used because each predicate in a WHERE clause could potentially use a different collation, and that can be different from the collation used in the ORDER BY
, and JOIN conditions (and GROUP BY, etc) can have their collations. 最后,没有确定的方法来获取所使用的排序规则,因为WHERE子句中的每个谓词都可能使用不同的排序规则,并且可能与ORDER BY
和JOIN条件中使用的排序规则不同(并且GROUP BY等)可以进行归类。
But to simplify a little: 但为了简化一点:
_BIN
collation then it should match the same settings in .NET. 如果数据是Unicode且未使用_BIN
排序规则,则它应与.NET中的相同设置匹配。 Again, the _BIN2
collation should match the ordinal
ComparisonStyle. 同样, _BIN2
排序规则应与ordinal
ComparisonStyle相匹配。 _BIN
collation, then cross your fingers, rub a lucky rabbit's foot (though not so lucky for the rabbit), etc. 如果数据是使用SQL Server排序规则或Windows _BIN
排序规则的非Unicode数据,则用手指交叉,用一只幸运的兔子的脚擦一下(尽管对兔子来说不太幸运),等等。 But wait, there's more! 但是,等等,还有更多! Seriously. 说真的
You need to also consider: 您还需要考虑:
I ended up doing the following: 我最终做了以下工作:
Comparator
corresponding to these rules 接下来,构造一个与这些规则相对应的Comparator
Enjoy! 请享用!
/**
* Returns the Comparator associated with the database's default collation.
* <p>
* Beware! <a href="http://stackoverflow.com/a/361059/14731">Some databases</a> sort unicode strings differently than
* non-unicode strings, even for the same collation setting.
* <p>
* @param unicode true if the String being sorted is unicode, false otherwise
* @return the Comparator associated with the database's default collation
* @throws DatabaseException if an unexpected database error occurs
*/
public Comparator<String> getComparator(boolean unicode)
throws DatabaseException
{
// @see http://stackoverflow.com/a/5072926/14731, http://stackoverflow.com/a/27052010/14731 and
// http://stackoverflow.com/q/32209137/14731
try (Connection connection = server.getDatasource().getConnection())
{
try (PreparedStatement statement = connection.prepareStatement(
"SELECT description from sys.fn_HelpCollations()\n" +
"WHERE name = SERVERPROPERTY('collation')"))
{
try (ResultSet rs = statement.executeQuery())
{
if (!rs.next())
throw new ObjectNotFoundException(this);
String description = rs.getString(1);
List<String> tokens = Arrays.asList(description.split(",\\s*"));
// Description format: language,property1,property2,...,propertyN,sorting,...
ComparatorBuilder comparatorBuilder = new ComparatorBuilder();
// Skip the language
tokens = tokens.subList(1, tokens.size());
// See https://technet.microsoft.com/en-US/library/ms143515(v=SQL.90).aspx for a list of possible tokens
for (String token: tokens)
{
if (token.toLowerCase().contains("sort"))
{
// Stop as soon as we hit information related to the sorting order
break;
}
switch (token)
{
case "case-insensitive":
{
comparatorBuilder.caseInsensitive(true);
break;
}
case "accent-insensitive":
{
comparatorBuilder.accentInsensitive(true);
break;
}
case "kanatype-insensitive":
{
comparatorBuilder.kanaInsensitive(true);
break;
}
case "width-insensitive":
case "width-insensitive for Unicode Data":
{
comparatorBuilder.widthInsensitive(true);
break;
}
case "case-sensitive":
case "accent-sensitive":
case "kanatype-sensitive":
case "width-sensitive":
{
// Do nothing, this is the default setting.
break;
}
default:
throw new AssertionError(String.format("Unexpected token: '%s'. Description: '%s'", token, description));
}
}
assert (!rs.next()): "Database returned more rows than expected";
if (unicode)
comparatorBuilder.discardHyphens(true);
return comparatorBuilder.build();
}
}
}
catch (SQLException e)
{
throw new DatabaseException(e);
}
}
import com.ibm.icu.text.Transliterator;
import java.text.Normalizer;
import java.util.Comparator;
/**
* Converts a database collation to a Java comparator.
* <p>
* @see https://msdn.microsoft.com/en-us/library/hh230914.aspx?f=255&MSPPError=-2147217396
* @see http://zarez.net/?p=1893
* @author Gili Tzabari
*/
class ComparatorBuilder
{
// SQL Server: https://technet.microsoft.com/en-US/library/ms143515(v=SQL.90).aspx
private boolean caseInsensitive = false;
private boolean accentInsensitive = false;
private boolean kanaInsensitive = false;
private boolean widthInsensitive = false;
/**
* Indicates if hyphens should be discarded prior to sorting (default = false).
*/
private boolean discardHyphens = false;
/**
* @return true if the comparator ignores the difference between uppercase and lowercase letters (default = false)
*/
public boolean caseInsensitive()
{
return caseInsensitive;
}
/**
* @param value true if the comparator ignores the difference between uppercase and lowercase letters
* @return this
*/
public ComparatorBuilder caseInsensitive(boolean value)
{
this.caseInsensitive = value;
return this;
}
/**
* @return true if the comparator ignores the difference between accented and unaccented characters (default = false)
*/
public boolean accentInsensitive()
{
return accentInsensitive;
}
/**
* @param value true if the comparator ignores the difference between accented and unaccented characters
* @return this
*/
public ComparatorBuilder accentInsensitive(boolean value)
{
this.accentInsensitive = value;
return this;
}
/**
* @return true if the comparator ignores the difference between the two types of Japanese kana characters: Hiragana
* and Katakana (default = false)
*/
public boolean kanaInsensitive()
{
return kanaInsensitive;
}
/**
* @param value true if the comparator ignores the difference between the two types of Japanese kana characters:
* Hiragana and Katakana
* @return this
*/
public ComparatorBuilder kanaInsensitive(boolean value)
{
this.kanaInsensitive = value;
return this;
}
/**
* @return true if the comparator ignores the difference between a single-byte character and the same character when
* represented as a double-byte character (default = false)
*/
public boolean widthInsensitive()
{
return widthInsensitive;
}
/**
* @param value true if the comparator ignores the difference between a single-byte character and the same character
* when represented as a double-byte character
* @return this
*/
public ComparatorBuilder widthInsensitive(boolean value)
{
this.widthInsensitive = value;
return this;
}
/**
* @return true if the comparator discards hyphens prior to sorting (default = false)
*/
public boolean discardHyphens()
{
return discardHyphens;
}
/**
* @param value true if comparator discards hyphens prior to sorting
* @return this
*/
public ComparatorBuilder discardHyphens(boolean value)
{
this.discardHyphens = value;
return this;
}
/**
* @return a Comparator instance
*/
public Comparator<String> build()
{
return (java.lang.String first, java.lang.String second) ->
{
String firstNormalized = first;
String secondNormalized = second;
if (discardHyphens)
{
firstNormalized = firstNormalized.replaceAll("-", "");
secondNormalized = secondNormalized.replaceAll("-", "");
}
if (accentInsensitive)
{
// @see http://stackoverflow.com/a/3322174/14731
firstNormalized = Normalizer.normalize(first, Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", "");
secondNormalized = Normalizer.normalize(second, Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", "");
}
if (kanaInsensitive)
{
// @see http://stackoverflow.com/a/6577778/14731
Transliterator transliterator = Transliterator.getInstance("Hiragana-Katakana");
firstNormalized = transliterator.transliterate(firstNormalized);
secondNormalized = transliterator.transliterate(secondNormalized);
}
if (widthInsensitive)
{
Transliterator transliterator = Transliterator.getInstance("Halfwidth-Fullwidth");
firstNormalized = transliterator.transliterate(firstNormalized);
secondNormalized = transliterator.transliterate(secondNormalized);
}
// Case-normalization is not as easy as it seems. See
// http://mattryall.net/blog/2009/02/the-infamous-turkish-locale-bug and the implementation of
// String.compareToIgnoreCase(). Better to delegate to a trusted implementation.
if (caseInsensitive)
return firstNormalized.compareToIgnoreCase(secondNormalized);
else
return firstNormalized.compareTo(secondNormalized);
};
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.