简体   繁体   中英

Convert UTF-8 varbinary(max) to varchar(max)

I have a varbinary(max) column with UTF-8-encoded text that has been compressed. I would like to decompress this data and work with it in T-SQL as a varchar(max) using the UTF-8 capabilities of SQL Server.

I'm looking for a way of specifying the encoding when converting from varbinary(max) to varchar(max). The only way I've managed to do that is by creating a table variable with a column with a UTF-8 collation and inserting the varbinary data into it.

DECLARE @rv TABLE(
    Res varchar(max) COLLATE Latin1_General_100_CI_AS_SC_UTF8 
)

INSERT INTO @rv
SELECT SUBSTRING(Decompressed, 4, DATALENGTH(Decompressed) - 3) WithoutBOM
FROM
    (SELECT DECOMPRESS(RawResource) AS Decompressed FROM Resource) t

I'm wondering if there is a more elegant and efficient approach that does not involve inserting into a table variable.

UPDATE:

Boiling this down to a simple example that doesn't deal with byte order marks or compression:

I have the string "Hello " UTF-8 encoded without a BOM stored in variable @utf8Binary

DECLARE @utf8Binary varbinary(max) = 0x48656C6C6F20F09F988A

Now I try to assign that into various char-based variables and print the result:

DECLARE @brokenVarChar varchar(max) = CONVERT(varchar(max), @utf8Binary)
print '@brokenVarChar = ' + @brokenVarChar

DECLARE @brokenNVarChar nvarchar(max) = CONVERT(varchar(max), @utf8Binary)
print '@brokenNVarChar = ' +  @brokenNVarChar 

DECLARE @rv TABLE(
    Res varchar(max) COLLATE Latin1_General_100_CI_AS_SC_UTF8 
)

INSERT INTO @rv
select @utf8Binary

DECLARE @working nvarchar(max)
Select TOP 1 @working = Res from @rv

print '@working = ' + @working

The results of this are:

@brokenVarChar = Hello 😊
@brokenNVarChar = Hello 😊
@working = Hello 😊

So I am able to get the binary result properly decoded using this indirect method, but I am wondering if there is a more straightforward (and likely efficient) approach.

I don't like this solution, but it's one I got to (I initially thought it wasn't working, due to what appears to be a bug in ADS). One method would be to create a new database in a UTF8 collation, and then pass the value to a function in that database. As the database is in a UTF8 collation, the default collation will be different to the local one, and the correct result will be returned:

CREATE DATABASE UTF8 COLLATE Latin1_General_100_CI_AS_SC_UTF8;
GO
USE UTF8;
GO
CREATE OR ALTER FUNCTION dbo.Bin2UTF8 (@utfbinary varbinary(MAX))
RETURNS varchar(MAX) AS
BEGIN
    RETURN CAST(@utfbinary AS varchar(MAX));
END
GO
USE YourDatabase;
GO
SELECT UTF8.dbo.Bin2UTF8(0x48656C6C6F20F09F988A);

This, however, isn't particularly "pretty".

There is an undocumented hack:

DECLARE @utf8 VARBINARY(MAX)=0x48656C6C6F20F09F988A;

SELECT CAST(CONCAT('<?xml version="1.0" encoding="UTF-8" ?><![CDATA[',@utf8,']]>') AS XML)
       .value('.','nvarchar(max)');

The result

Hello 😊

This works even in versions without the new UTF8 collations...

UPDATE: calling this as a function

This can easily be wrapped in a scalar function

CREATE FUNCTION dbo.Convert_UTF8_Binary_To_NVarchar(@utfBinary VARBINARY(MAX))
RETURNS NVARCHAR(MAX)
AS
BEGIN
    RETURN
    (
    SELECT CAST(CONCAT('<?xml version="1.0" encoding="UTF-8" ?><![CDATA[',@utfBinary,']]>') AS XML)
           .value('.','nvarchar(max)')
    );
END
GO

Or like this as an inlined table valued function

CREATE FUNCTION dbo.Convert_UTF8_Binary_To_NVarchar(@utfBinary VARBINARY(MAX))
RETURNS TABLE
AS
    RETURN
    SELECT CAST(CONCAT('<?xml version="1.0" encoding="UTF-8" ?><![CDATA[',@utfBinary,']]>') AS XML)
           .value('.','nvarchar(max)') AS ConvertedString
GO

This can be used after FROM or - more appropriate - with APPLY

DECLARE @utf8Binary varbinary(max) = 0x48656C6C6F20F09F988A;
DECLARE @brokenNVarChar nvarchar(max) = concat(@utf8Binary, '' COLLATE Latin1_General_100_CI_AS_SC_UTF8);
print '@brokenNVarChar = ' +  @brokenNVarChar;

You didn't say how your data is compressed or what compression algorithm was used. But if you are using the COMPRESS function in SQL Server 2016 or later, you can use the DECOMPRESS function and then cast it to a VARCHAR(MAX) . Both COMPRESS and DECOMPRESS use the GZip compression algorithm. This function will decompress an input expression value, using the GZIP algorithm. DECOMPRESS will return a byte array ( VARBINARY(MAX) type).

CAST(DECOMPRESS([compressed content here]) AS VARCHAR(MAX))

See: COMPRESS (Transact-SQL) and DECOMPRESS (Transact-SQL)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM