简体   繁体   English

如何将SQL排序规则设置映射到Java比较器?

[英]How to map SQL collation setting to a Java comparator?

Is there a way to translate a database's collation setting (eg SQL_Latin1_General_CP1_CI_AS ) to a Java Comparator implementation so I can apply the same ordering as the database does, using in Java code? 有没有一种方法可以将数据库的排序SQL_Latin1_General_CP1_CI_AS设置(例如SQL_Latin1_General_CP1_CI_AS )转换为Java Comparator实现,以便我可以在Java代码中使用与数据库相同的排序方式?

Is there an existing library that already provides this mapping? 是否有一个已经提供此映射的现有库?

Simplistically, you can use the COLLATIONPROPERTY function, which gives you: 简单来说,您可以使用COLLATIONPROPERTY函数,该函数为您提供:

  • CodePage 代码页
  • LCID LCID
  • ComparisonStyle 比较样式
  • Version

The ComparisonStyle is a bit-masked field that is encoded as follows: ComparisonStyle是一个位掩码字段,其编码如下:

  • Case insensitivity (IgnoreCase) = 1 不区分大小写(IgnoreCase)= 1
  • Accent insensitivity (IgnoreNonSpace) = 2, 口音不敏感(IgnoreNonSpace)= 2
  • Kana type insensitivity (IgnoreKanaType) = 65536 假名类型不敏感(IgnoreKanaType)= 65536
  • Width insensitivity (IgnoreWidth) = 131072 宽度不敏感度(IgnoreWidth)= 131072

Unfortunately, everything-sensitive (eg Latin1_General_CS_AS_KS_WS ) equates to 0. This is unfortunate because both _BIN and _BIN2 collations also equate to 0. Hence you still need to check the name to see if it ends in _BIN% to get the full picture. 不幸的是,所有敏感的东西(例如Latin1_General_CS_AS_KS_WS )都等于0。这很不幸,因为_BIN_BIN2归类也都等于0。因此,您仍然需要检查名称以查看其名称是否以_BIN%结尾才能获得全貌。


But, this is not so simple. 但是,这并不是那么简单。 There are two main types of Collations: SQL Server collations and Windows Collations. 归类主要有两种类型:SQL Server归类和Windows归类。

The SQL Server collations (ie starting with SQL_ ) are deprecated and should not be used anymore, though a lot of systems to default to SQL_Latin1_General_CP1_CI_AS . 尽管许多系统默认使用SQL_Latin1_General_CP1_CI_AS ,但SQL Server排序规则(即以SQL_开头)已被弃用,不应再使用。

For both types of collations, NCHAR / NVARCHAR / XML data uses the Unicode sorting algorithms. 对于两种归类,NCHAR / NVARCHAR / XML数据均使用Unicode排序算法。 For non-Unicode data, the Windows collations should sort the same between SQL Server and .NET. 对于非Unicode数据,Windows排序规则应在SQL Server和.NET之间进行排序。 However, for the SQL Server collations, the sorting algorithm does not necessarily match to the Windows collation (or possibly anything). 但是,对于SQL Server归类,排序算法不一定与Windows归类(或可能的任何事物)匹配。 But they do have their own Sort Order IDs and there might be public documentation describing those rules. 但是它们确实有自己的排序顺序ID,并且可能会有描述这些规则的公共文档。

The Windows collations have several variations: Windows归类具有多种变体:

  • differing versions: unspecified should be the original set, then the first set of updates are labeled _90 and the newest updates are the _100 series. 不同版本:未指定应为原始版本,然后第一组更新标记为_90 ,最新更新为_100系列。

  • differing binary ordering: the older _BIN collations do not map to anything exactly in .NET since they compared the first character as a character. 不同的二进制顺序:较旧的_BIN归类无法将.NET完全映射到任何内容,因为它们将第一个字符作为字符进行了比较。 The newer _BIN2 collations are pure code-point comparisons and ordering and should map to the ordinal ComparisonStyle. 较新的_BIN2归类是纯代码点比较和排序,应映射到ordinal ComparisonStyle。

Beyond the specifics of any particular collation, there is another factor complicating what you are trying to accomplish: the default collation for a database does not necessarily determine the collation used for sorting / comparing a particular predicate or field! 除了任何特定归类的细节之外,还有另一个因素使您要完成的工作变得复杂:数据库的默认归类不一定确定用于排序/比较特定谓词或字段的归类! The collation can be taken from the field being operated on, it can be taken from the database default for string literals and variables, or it can be overridden in both cases via the COLLATE clause. 排序规则可以从正在操作的字段中获取,也可以从数据库默认的字符串文字和变量中获取,或者在两种情况下都可以通过COLLATE子句覆盖它。 Please see the MSDN page for Collation Precedence for more details. 有关更多详细信息,请参见MSDN页面中的排序规则优先级

In the end, there is no deterministic means of getting the collation(s) used because each predicate in a WHERE clause could potentially use a different collation, and that can be different from the collation used in the ORDER BY , and JOIN conditions (and GROUP BY, etc) can have their collations. 最后,没有确定的方法来获取所使用的排序规则,因为WHERE子句中的每个谓词都可能使用不同的排序规则,并且可能与ORDER BY和JOIN条件中使用的排序规则不同(并且GROUP BY等)可以进行归类。

But to simplify a little: 但为了简化一点:

  • If the data is non-Unicode, check the Code Page for the specified locale / LCID. 如果数据不是Unicode,请检查“代码页”以获取指定的语言环境/ LCID。 Then use that to create the same Encoding in .NET. 然后使用它在.NET中创建相同的编码。
  • If the data is Unicode and not using a _BIN collation then it should match the same settings in .NET. 如果数据是Unicode且未使用_BIN排序规则,则它应与.NET中的相同设置匹配。 Again, the _BIN2 collation should match the ordinal ComparisonStyle. 同样, _BIN2排序规则应与ordinal ComparisonStyle相匹配。
  • If the data is non-Unicode with a SQL Server collation or Windows _BIN collation, then cross your fingers, rub a lucky rabbit's foot (though not so lucky for the rabbit), etc. 如果数据是使用SQL Server排序规则或Windows _BIN排序规则的非Unicode数据,则用手指交叉,用一只幸运的兔子的脚擦一下(尽管对兔子来说不太幸运),等等。

But wait, there's more! 但是,等等,还有更多! Seriously. 说真的

You need to also consider: 您还需要考虑:

  • as with any standard, it is up to the implementer to follow the spec. 与任何标准一样,实施者应遵循规范。 That doesn't always happen. 并非总是如此。 So even with what should be a truly equivalent collation between SQL Server and your Java app, and even if there are no issues with Collation Precedence, there can still be differences in sorting and comparisons. 因此,即使在SQL Server和Java应用程序之间应该具有真正等效的排序规则,并且即使排序规则优先级没有问题,排序和比较仍可能存在差异。 For an example, check out my "update" on this answer on DBA.StackExchange: Why does MS SQL Server return a result for empty string check when Unicode string is not empty 例如,请在DBA.StackExchange上针对此答案签出我的“更新”: 为什么当Unicode字符串不为空时,MS SQL Server返回空字符串检查的结果
  • If you are transferring data between .NET and Java, keep in mind that Java is UTF-16 Big Endian while .NET is UTF-16 Little Endian. 如果要在.NET和Java之间传输数据,请记住Java是UTF-16 Big Endian,而.NET是UTF-16 Little Endian。

I ended up doing the following: 我最终做了以下工作:

  1. Query the current database's collation setting. 查询当前数据库的排序规则设置。
  2. Next, parse the description of the collator into sub-components such as "case-insensitive" or "accent-sensitive". 接下来,将整理程序的描述解析为子组件,例如“不区分大小写”或“区分重音”。
  3. Next, construct a Comparator corresponding to these rules 接下来,构造一个与这些规则相对应的Comparator

Enjoy! 请享用!

/**
 * Returns the Comparator associated with the database's default collation.
 * <p>
 * Beware! <a href="http://stackoverflow.com/a/361059/14731">Some databases</a> sort unicode strings differently than
 * non-unicode strings, even for the same collation setting.
 * <p>
 * @param unicode true if the String being sorted is unicode, false otherwise
 * @return the Comparator associated with the database's default collation
 * @throws DatabaseException if an unexpected database error occurs
 */
public Comparator<String> getComparator(boolean unicode)
    throws DatabaseException
{
    // @see http://stackoverflow.com/a/5072926/14731, http://stackoverflow.com/a/27052010/14731 and
    // http://stackoverflow.com/q/32209137/14731
    try (Connection connection = server.getDatasource().getConnection())
    {
        try (PreparedStatement statement = connection.prepareStatement(
            "SELECT description from sys.fn_HelpCollations()\n" +
            "WHERE name = SERVERPROPERTY('collation')"))
        {
            try (ResultSet rs = statement.executeQuery())
            {
                if (!rs.next())
                    throw new ObjectNotFoundException(this);
                String description = rs.getString(1);
                List<String> tokens = Arrays.asList(description.split(",\\s*"));
                // Description format: language,property1,property2,...,propertyN,sorting,...
                ComparatorBuilder comparatorBuilder = new ComparatorBuilder();

                // Skip the language
                tokens = tokens.subList(1, tokens.size());
                // See https://technet.microsoft.com/en-US/library/ms143515(v=SQL.90).aspx for a list of possible tokens
                for (String token: tokens)
                {
                    if (token.toLowerCase().contains("sort"))
                    {
                        // Stop as soon as we hit information related to the sorting order
                        break;
                    }
                    switch (token)
                    {
                        case "case-insensitive":
                        {
                            comparatorBuilder.caseInsensitive(true);
                            break;
                        }
                        case "accent-insensitive":
                        {
                            comparatorBuilder.accentInsensitive(true);
                            break;
                        }
                        case "kanatype-insensitive":
                        {
                            comparatorBuilder.kanaInsensitive(true);
                            break;
                        }
                        case "width-insensitive":
                        case "width-insensitive for Unicode Data":
                        {
                            comparatorBuilder.widthInsensitive(true);
                            break;
                        }
                        case "case-sensitive":
                        case "accent-sensitive":
                        case "kanatype-sensitive":
                        case "width-sensitive":
                        {
                            // Do nothing, this is the default setting.
                            break;
                        }
                        default:
                            throw new AssertionError(String.format("Unexpected token: '%s'. Description: '%s'", token, description));
                    }
                }
                assert (!rs.next()): "Database returned more rows than expected";
                if (unicode)
                    comparatorBuilder.discardHyphens(true);
                return comparatorBuilder.build();
            }
        }
    }
    catch (SQLException e)
    {
        throw new DatabaseException(e);
    }
}

import com.ibm.icu.text.Transliterator;
import java.text.Normalizer;
import java.util.Comparator;

/**
 * Converts a database collation to a Java comparator.
 * <p>
 * @see https://msdn.microsoft.com/en-us/library/hh230914.aspx?f=255&MSPPError=-2147217396
 * @see http://zarez.net/?p=1893
 * @author Gili Tzabari
 */
class ComparatorBuilder
{
    // SQL Server: https://technet.microsoft.com/en-US/library/ms143515(v=SQL.90).aspx
    private boolean caseInsensitive = false;
    private boolean accentInsensitive = false;
    private boolean kanaInsensitive = false;
    private boolean widthInsensitive = false;
    /**
     * Indicates if hyphens should be discarded prior to sorting (default = false).
     */
    private boolean discardHyphens = false;

    /**
     * @return true if the comparator ignores the difference between uppercase and lowercase letters (default = false)
     */
    public boolean caseInsensitive()
    {
        return caseInsensitive;
    }

    /**
     * @param value true if the comparator ignores the difference between uppercase and lowercase letters
     * @return this
     */
    public ComparatorBuilder caseInsensitive(boolean value)
    {
        this.caseInsensitive = value;
        return this;
    }

    /**
     * @return true if the comparator ignores the difference between accented and unaccented characters (default = false)
     */
    public boolean accentInsensitive()
    {
        return accentInsensitive;
    }

    /**
     * @param value true if the comparator ignores the difference between accented and unaccented characters
     * @return this
     */
    public ComparatorBuilder accentInsensitive(boolean value)
    {
        this.accentInsensitive = value;
        return this;
    }

    /**
     * @return true if the comparator ignores the difference between the two types of Japanese kana characters: Hiragana
     *         and Katakana (default = false)
     */
    public boolean kanaInsensitive()
    {
        return kanaInsensitive;
    }

    /**
     * @param value true if the comparator ignores the difference between the two types of Japanese kana characters:
     *              Hiragana and Katakana
     * @return this
     */
    public ComparatorBuilder kanaInsensitive(boolean value)
    {
        this.kanaInsensitive = value;
        return this;
    }

    /**
     * @return true if the comparator ignores the difference between a single-byte character and the same character when
     *         represented as a double-byte character (default = false)
     */
    public boolean widthInsensitive()
    {
        return widthInsensitive;
    }

    /**
     * @param value true if the comparator ignores the difference between a single-byte character and the same character
     *              when represented as a double-byte character
     * @return this
     */
    public ComparatorBuilder widthInsensitive(boolean value)
    {
        this.widthInsensitive = value;
        return this;
    }

    /**
     * @return true if the comparator discards hyphens prior to sorting (default = false)
     */
    public boolean discardHyphens()
    {
        return discardHyphens;
    }

    /**
     * @param value true if comparator discards hyphens prior to sorting
     * @return this
     */
    public ComparatorBuilder discardHyphens(boolean value)
    {
        this.discardHyphens = value;
        return this;
    }

    /**
     * @return a Comparator instance
     */
    public Comparator<String> build()
    {
        return (java.lang.String first, java.lang.String second) ->
        {
            String firstNormalized = first;
            String secondNormalized = second;
            if (discardHyphens)
            {
                firstNormalized = firstNormalized.replaceAll("-", "");
                secondNormalized = secondNormalized.replaceAll("-", "");
            }
            if (accentInsensitive)
            {
                // @see http://stackoverflow.com/a/3322174/14731
                firstNormalized = Normalizer.normalize(first, Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", "");
                secondNormalized = Normalizer.normalize(second, Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", "");
            }
            if (kanaInsensitive)
            {
                // @see http://stackoverflow.com/a/6577778/14731
                Transliterator transliterator = Transliterator.getInstance("Hiragana-Katakana");
                firstNormalized = transliterator.transliterate(firstNormalized);
                secondNormalized = transliterator.transliterate(secondNormalized);
            }
            if (widthInsensitive)
            {
                Transliterator transliterator = Transliterator.getInstance("Halfwidth-Fullwidth");
                firstNormalized = transliterator.transliterate(firstNormalized);
                secondNormalized = transliterator.transliterate(secondNormalized);
            }
            // Case-normalization is not as easy as it seems. See
            // http://mattryall.net/blog/2009/02/the-infamous-turkish-locale-bug and the implementation of
            // String.compareToIgnoreCase(). Better to delegate to a trusted implementation.
            if (caseInsensitive)
                return firstNormalized.compareToIgnoreCase(secondNormalized);
            else
                return firstNormalized.compareTo(secondNormalized);
        };
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM