简体   繁体   English

将 UTF-8 varbinary(max) 转换为 varchar(max)

[英]Convert UTF-8 varbinary(max) to varchar(max)

I have a varbinary(max) column with UTF-8-encoded text that has been compressed.我有一个 varbinary(max) 列,其中包含已压缩的 UTF-8 编码文本。 I would like to decompress this data and work with it in T-SQL as a varchar(max) using the UTF-8 capabilities of SQL Server.我想解压缩此数据并使用 SQL 服务器的 UTF-8 功能在 T-SQL 中将其作为 varchar(max) 使用。

I'm looking for a way of specifying the encoding when converting from varbinary(max) to varchar(max).我正在寻找一种在从 varbinary(max) 转换为 varchar(max) 时指定编码的方法。 The only way I've managed to do that is by creating a table variable with a column with a UTF-8 collation and inserting the varbinary data into it.我设法做到这一点的唯一方法是创建一个表变量,其中包含一个排序规则为 UTF-8 的列,并将 varbinary 数据插入其中。

DECLARE @rv TABLE(
    Res varchar(max) COLLATE Latin1_General_100_CI_AS_SC_UTF8 
)

INSERT INTO @rv
SELECT SUBSTRING(Decompressed, 4, DATALENGTH(Decompressed) - 3) WithoutBOM
FROM
    (SELECT DECOMPRESS(RawResource) AS Decompressed FROM Resource) t

I'm wondering if there is a more elegant and efficient approach that does not involve inserting into a table variable.我想知道是否有一种更优雅、更有效的方法不涉及插入到表变量中。

UPDATE:更新:

Boiling this down to a simple example that doesn't deal with byte order marks or compression:将其归结为一个不处理字节顺序标记或压缩的简单示例:

I have the string "Hello " UTF-8 encoded without a BOM stored in variable @utf8Binary我有字符串“Hello”UTF-8 编码,没有存储在变量@utf8Binary中的 BOM

DECLARE @utf8Binary varbinary(max) = 0x48656C6C6F20F09F988A

Now I try to assign that into various char-based variables and print the result:现在我尝试将其分配给各种基于字符的变量并打印结果:

DECLARE @brokenVarChar varchar(max) = CONVERT(varchar(max), @utf8Binary)
print '@brokenVarChar = ' + @brokenVarChar

DECLARE @brokenNVarChar nvarchar(max) = CONVERT(varchar(max), @utf8Binary)
print '@brokenNVarChar = ' +  @brokenNVarChar 

DECLARE @rv TABLE(
    Res varchar(max) COLLATE Latin1_General_100_CI_AS_SC_UTF8 
)

INSERT INTO @rv
select @utf8Binary

DECLARE @working nvarchar(max)
Select TOP 1 @working = Res from @rv

print '@working = ' + @working

The results of this are:结果是:

@brokenVarChar = Hello 😊
@brokenNVarChar = Hello 😊
@working = Hello 😊

So I am able to get the binary result properly decoded using this indirect method, but I am wondering if there is a more straightforward (and likely efficient) approach.所以我能够使用这种间接方法正确解码二进制结果,但我想知道是否有更直接(并且可能有效)的方法。

I don't like this solution, but it's one I got to (I initially thought it wasn't working, due to what appears to be a bug in ADS).我不喜欢这个解决方案,但它是我得到的(我最初认为它不起作用,因为 ADS 中似乎有一个错误)。 One method would be to create a new database in a UTF8 collation, and then pass the value to a function in that database.一种方法是在 UTF8 排序规则中创建一个新数据库,然后将该值传递给该数据库中的函数。 As the database is in a UTF8 collation, the default collation will be different to the local one, and the correct result will be returned:由于数据库采用UTF8排序规则,因此默认排序规则将与本地不同,并返回正确的结果:

CREATE DATABASE UTF8 COLLATE Latin1_General_100_CI_AS_SC_UTF8;
GO
USE UTF8;
GO
CREATE OR ALTER FUNCTION dbo.Bin2UTF8 (@utfbinary varbinary(MAX))
RETURNS varchar(MAX) AS
BEGIN
    RETURN CAST(@utfbinary AS varchar(MAX));
END
GO
USE YourDatabase;
GO
SELECT UTF8.dbo.Bin2UTF8(0x48656C6C6F20F09F988A);

This, however, isn't particularly "pretty".然而,这并不是特别“漂亮”。

There is an undocumented hack:有一个未记录的黑客:

DECLARE @utf8 VARBINARY(MAX)=0x48656C6C6F20F09F988A;

SELECT CAST(CONCAT('<?xml version="1.0" encoding="UTF-8" ?><![CDATA[',@utf8,']]>') AS XML)
       .value('.','nvarchar(max)');

The result结果

Hello 😊

This works even in versions without the new UTF8 collations...这甚至适用于没有新 UTF8 排序规则的版本......

UPDATE: calling this as a function更新:将其作为函数调用

This can easily be wrapped in a scalar function这可以很容易地包装在标量函数中

CREATE FUNCTION dbo.Convert_UTF8_Binary_To_NVarchar(@utfBinary VARBINARY(MAX))
RETURNS NVARCHAR(MAX)
AS
BEGIN
    RETURN
    (
    SELECT CAST(CONCAT('<?xml version="1.0" encoding="UTF-8" ?><![CDATA[',@utfBinary,']]>') AS XML)
           .value('.','nvarchar(max)')
    );
END
GO

Or like this as an inlined table valued function或者像这样作为内联表值函数

CREATE FUNCTION dbo.Convert_UTF8_Binary_To_NVarchar(@utfBinary VARBINARY(MAX))
RETURNS TABLE
AS
    RETURN
    SELECT CAST(CONCAT('<?xml version="1.0" encoding="UTF-8" ?><![CDATA[',@utfBinary,']]>') AS XML)
           .value('.','nvarchar(max)') AS ConvertedString
GO

This can be used after FROM or - more appropriate - with APPLY这可以在FROM之后使用,或者 - 更合适 - 与APPLY

DECLARE @utf8Binary varbinary(max) = 0x48656C6C6F20F09F988A;
DECLARE @brokenNVarChar nvarchar(max) = concat(@utf8Binary, '' COLLATE Latin1_General_100_CI_AS_SC_UTF8);
print '@brokenNVarChar = ' +  @brokenNVarChar;

You didn't say how your data is compressed or what compression algorithm was used.你没有说你的数据是如何压缩的或者使用了什么压缩算法。 But if you are using the COMPRESS function in SQL Server 2016 or later, you can use the DECOMPRESS function and then cast it to a VARCHAR(MAX) .但是,如果您在 SQL Server 2016 或更高版本中使用COMPRESS function,则可以使用DECOMPRESS function,然后将其转换为VARCHAR(MAX) Both COMPRESS and DECOMPRESS use the GZip compression algorithm. COMPRESSDECOMPRESS都使用 GZip 压缩算法。 This function will decompress an input expression value, using the GZIP algorithm.这个 function 将使用 GZIP 算法解压缩输入的表达式值。 DECOMPRESS will return a byte array ( VARBINARY(MAX) type). DECOMPRESS将返回一个字节数组( VARBINARY(MAX)类型)。

CAST(DECOMPRESS([compressed content here]) AS VARCHAR(MAX))

See: COMPRESS (Transact-SQL) and DECOMPRESS (Transact-SQL)请参阅: COMPRESS (Transact-SQL)DECOMPRESS (Transact-SQL)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM