简体   繁体   English

SQL Server在nvarchar字符串中选择Unicode空字符

[英]SQL Server select unicode null characters in nvarchar strings

I have content that has been imported into our SQL Server 2008 database (using collation SQL_Latin1_General_CP1_CI_AS ) that is contaminated with UNICODE NULLS in nvarchar(128) columns. 我有已导入到我们的SQL Server 2008数据库中的内容(使用归类SQL_Latin1_General_CP1_CI_AS ),该内容已被nvarchar(128)列中的UNICODE NULLS污染。

The impact is that it blows up our java libraries when they try to export the content in PDF reports and other such manipulations. 其影响是,当Java库尝试导出PDF报告和其他此类操作中的内容时,它会炸毁。

I am trying to locate and modify the values in the various tables and columns. 我正在尝试查找和修改各种表和列中的值。 I am told by some of our staff that the offending values look like 'usernam e' instead of 'username' . 我们的一些工作人员告诉我,令人讨厌的值看起来像是'usernam e'而不是'username'

In trying to find these offending UNICODE NULLS, I've run the following SQL: 为了找到这些令人讨厌的UNICODE NULL,我运行了以下SQL:

SELECT name 
FROM users
WHERE name LIKE '%[^ -~]%' COLLATE Latin1_General_BIN

Returned is the following set: 返回以下集合:

M
M
M
N
S
S
S
S
ÿþA

I think that these one-letter values might be followed by UNICODE NULLS, but I don't know for sure. 我认为这些一个字母的值可能后跟UNICODE NULLS,但我不确定。 the final one certainly looks suspicious as well. 最后一个当然也很可疑。

Is there some way of using CONVERT and the hex value -- 0x00 to locate UNICODE NULLS in nvarchar strings? 有什么方法可以使用CONVERT和十六进制值0x00在nvarchar字符串中定位UNICODE NULL?

EDIT #1: 编辑#1:

select name, CAST(RIGHT(name,1) AS varbinary(128)) AS RIGHTER_1,
from users
where id=1

returns:

B   0x4200

So, that letter 'B' is a bit funny. 因此,字母“ B”有点有趣。 There really are UNICODE NULLS here, and the libraries are not architected to handle UNICODE. 这里确实有UNICODE NULLS,并且库不是为处理UNICODE而设计的。 They're rock solid with LATIN UTF8 chars. 它们与LATIN UTF8字符紧密结合在一起。

You could use CAST(name AS varbinary(128)) to see the value as hex and examine it. 您可以使用CAST(name AS varbinary(128))将值视为十六进制并进行检查。

You could find 'null characters' using a condition name LIKE '%'+CHAR(0)+'%' , however, a valid unicode string could contain zeroes as well, so this is probably not what you need to do. 您可以使用条件name LIKE '%'+CHAR(0)+'%'查找'null character',但是,有效的unicode字符串也可以包含零,因此这可能不是您需要做的。

Are you sure that the problem is not in your libraries, or in PDF generator? 您确定问题不在您的库中还是在PDF生成器中? It looks like you have unicode strings in the database, but the application is interpreting them as ASCII strings. 看起来您的数据库中有unicode字符串,但是应用程序将它们解释为ASCII字符串。

Trying to look for null unicode character sequences using varbinary conversions can result in false positives, for example the following unicode in UTF16 LE: 尝试使用varbinary转换查找空的unicode字符序列会导致误报,例如UTF16 LE中的以下unicode:

20 00 00 A0

The string is a space followed by a unicode character A0. 该字符串是一个空格,后跟一个Unicode字符A0。 Both are valid non-null characters. 两者都是有效的非空字符。 However if you did this: 但是,如果您这样做:

where charindex (0x0000, cast(UnicodeText as varbinary (max))) > 0

You would get a false positive between the end of the space and the beginning of the next character. 在空格的末尾和下一个字符的开头之间,您会得到假肯定。

Here's a function that I wrote. 这是我编写的函数。 Note, it does not perform very well with large text, something I'm working on improving. 请注意,它在处理大文本时效果不佳,这是我正在努力改进的方面。 Possibly a CLR proc would work better. 可能CLR proc会更好。 Try this: 尝试这个:

    create function dbo.FindNullUnicode
(
    @Input nvarchar(max)
    ,@StartPosition bigint = 1
)
returns bigint
as
begin
    if @StartPosition < 1
        set @StartPosition = 1;

    declare @pos bigint = @StartPosition;
    declare @len bigint = len(@Input);
    declare @singlechar nchar(1);

    while (@pos <= @len)
    begin
        if unicode(SUBSTRING(@input,@pos,1)) = 0 
            return @pos;

        set @pos +=1;
    end;
    return 0;
end

Give that the original post is more than 9 months old, this is, I am sure, too late for the poster. 假设原始帖子的发布时间已超过9个月,我相信这对于发布者来说为时已晚。 But, per the documentation , the nchar and nvarchar data types are Unicode. 但是, 根据文档ncharnvarchar数据类型 Unicode。 They are defined as: 它们定义为:

| | String data types that are either fixed-length, nchar , or variable-length, nvarchar , Unicode data and use the UNICODE UCS-2 character set. 字符串数据类型为固定长度nchar或可变长度nvarchar Unicode数据,并使用UNICODE UCS-2字符集。

UCS-2 means each character in the column occupies 2 bytes. UCS-2表示该列中的每个字符占用2个字节。 If the data is single byte characters, the high-order byte will be 0x00, naturally and every other octet be 0x00. 如果数据是单字节字符,则高位字节自然为0x00,其他八位字节均为0x00。

The original problem was that the consumer was almost certainly expecting ASCII or UTF-8 data rather than UCS-2/UTF-16. 最初的问题是,消费者几乎可以肯定希望使用ASCII或UTF-8数据,而不是UCS-2 / UTF-16。 Most likely the columns should have been declared as char / varchar rather than nchar / nvarchar . 最有可能的是,这些列应该被声明为char / varchar而不是nchar / nvarchar The proper solution would be to do one of the following: 正确的解决方案是执行以下任一操作:

  • Alter the table so the columns are the correct data type 更改表,使列为正确的数据类型
  • Alter the query to transform the columns using the convert() function, thus: convert(varchar(4000),my_nvarchar_column) 使用convert()函数更改查询以转换列,因此: convert(varchar(4000),my_nvarchar_column)
  • alter the consumer to properly consume the double-byte characters. 更改使用者以正确使用双字节字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM