简体   繁体   中英

Insert UTF8 data into a SQL Server 2008

I have an issue with encoding. I want to put data from a UTF-8-encoded file into a SQL Server 2008 database. SQL Server only features UCS-2 encoding, so I decided to explicitly convert the retrieved data.

// connect to page file
_fsPage = new FileStream(mySettings.filePage, FileMode.Open, FileAccess.Read);
_streamPage = new StreamReader(_fsPage, System.Text.Encoding.UTF8);

Here's the conversion routine for the data:

private string ConvertTitle(string title)
{
  string utf8_String = Regex.Replace(Regex.Replace(title, @"\\.", _myEvaluator), @"(?<=[^\\])_", " ");
  byte[] utf8_bytes = System.Text.Encoding.UTF8.GetBytes(utf8_String);
  byte[] ucs2_bytes = System.Text.Encoding.Convert(System.Text.Encoding.UTF8, System.Text.Encoding.Unicode, utf8_bytes);
  string ucs2_String = System.Text.Encoding.Unicode.GetString(ucs2_bytes);

  return ucs2_String;
}

When stepping through the code for critical titles, variable watch shows the correct characters for both utf-8 and ucs-2 string. But in the database its - partially wrong. Some special chars are saved correctly, others not.

  • Wrong: ń becomes an n
  • Right: É or é are for example inserted correctly.

Any idea where the problem might be and how to solve it?

Thans in advance, Frank

SQL server 2008 handles the conversion from UTF-8 into UCS-2 for you.

First make sure your SQL tables are using nchar, nvarchar data types for the columns. Then you need to tell SQL Server your sending in Unicode data by adding an N in front of the encoded string.

INSERT INTO tblTest (test) VALUES (N'EncodedString')

from Microsoft http://support.microsoft.com/kb/239530

See my question and solution here: How do I convert UTF-8 data from Classic asp Form post to UCS-2 for inserting into SQL Server 2008 r2?

I think you have a misunderstanding of what encodings are. An encoding is used to convert a bunch of bytes into a character string. A String does not itself have an encoding associated with it.

Internally, Strings are stored in memory as UTF-16LE bytes (which is why Windows persists in confusing everyone by calling the UTF-16LE encoding just “Unicode”). But you don't need to know that — to you, they're just strings of characters.

What your function does is:

  1. Takes a string and converts it to UTF-8 bytes.
  2. Takes those UTF-8 bytes and converts them to UTF-16LE bytes. (You could have just encoded straight to UTF-16LE instead of UTF-8 in step one.)
  3. Takes those UTF-16LE bytes and converts them back to a string. This gives you the exact same String you had in the first place!

So this function is redundant; you can actually just pass a normal String to SQL Server from .NET and not worry about it.

The bit with the backslashes does do something, presumably application-specific I don't understand what it's for. But nothing in that function will cause Windows to flatten characters like ń to n.

What /will/ cause that kind of flattening is when you try to put characters that aren't in the database's own encoding in the database. Presumably é is OK because that character is in your default encoding of cp1252 Western European, but ń is not so it gets mangled.

SQL Server does use 'UCS2' (really UTF-16LE again) to store Unicode strings, but you have tell it to, typically by using a NATIONAL CHARACTER (NCHAR/NVARCHAR) column type instead of plain CHAR.

We were also very confused about encoding. Here is an useful page that explains it. Also, answer to following SO question will help to explain it too -

In C# String/Character Encoding what is the difference between GetBytes(), GetString() and Convert()?

对于使用较新版本的未来读者,请注意SQL Server 2016在其bcp实用程序中支持UTF-8。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM