简体   繁体   English

将UTF8数据插入SQL Server 2008

[英]Insert UTF8 data into a SQL Server 2008

I have an issue with encoding. 我有编码问题。 I want to put data from a UTF-8-encoded file into a SQL Server 2008 database. 我想将UTF-8编码文件中的数据放入SQL Server 2008数据库。 SQL Server only features UCS-2 encoding, so I decided to explicitly convert the retrieved data. SQL Server仅具有UCS-2编码,因此我决定显式转换检索到的数据。

// connect to page file
_fsPage = new FileStream(mySettings.filePage, FileMode.Open, FileAccess.Read);
_streamPage = new StreamReader(_fsPage, System.Text.Encoding.UTF8);

Here's the conversion routine for the data: 这是数据的转换例程:

private string ConvertTitle(string title)
{
  string utf8_String = Regex.Replace(Regex.Replace(title, @"\\.", _myEvaluator), @"(?<=[^\\])_", " ");
  byte[] utf8_bytes = System.Text.Encoding.UTF8.GetBytes(utf8_String);
  byte[] ucs2_bytes = System.Text.Encoding.Convert(System.Text.Encoding.UTF8, System.Text.Encoding.Unicode, utf8_bytes);
  string ucs2_String = System.Text.Encoding.Unicode.GetString(ucs2_bytes);

  return ucs2_String;
}

When stepping through the code for critical titles, variable watch shows the correct characters for both utf-8 and ucs-2 string. 当单步执行关键标题的代码时,变量监视会显示utf-8和ucs-2字符串的正确字符。 But in the database its - partially wrong. 但在数据库中它 - 部分错误。 Some special chars are saved correctly, others not. 有些特殊字符可以正确保存,有些则不能保存。

  • Wrong: ń becomes an n 错了:ñ变成了n
  • Right: É or é are for example inserted correctly. 右:É或é例如正确插入。

Any idea where the problem might be and how to solve it? 知道问题可能在哪里以及如何解决?

Thans in advance, Frank 坦率之前,弗兰克

SQL server 2008 handles the conversion from UTF-8 into UCS-2 for you. SQL Server 2008为您处理从UTF-8到UCS-2的转换。

First make sure your SQL tables are using nchar, nvarchar data types for the columns. 首先确保您的SQL表使用列的nchar,nvarchar数据类型。 Then you need to tell SQL Server your sending in Unicode data by adding an N in front of the encoded string. 然后,您需要通过在编码字符串前面添加N来告诉SQL Server您在Unicode数据中的发送。

INSERT INTO tblTest (test) VALUES (N'EncodedString')

from Microsoft http://support.microsoft.com/kb/239530 来自Microsoft http://support.microsoft.com/kb/239530

See my question and solution here: How do I convert UTF-8 data from Classic asp Form post to UCS-2 for inserting into SQL Server 2008 r2? 请在此处查看我的问题和解决方案: 如何将经典asp表单帖子中的UTF-8数据转换为UCS-2以插入SQL Server 2008 r2?

I think you have a misunderstanding of what encodings are. 我认为你对编码是什么有误解。 An encoding is used to convert a bunch of bytes into a character string. 编码用于将一堆字节转换为字符串。 A String does not itself have an encoding associated with it. String本身不具有与之关联的编码。

Internally, Strings are stored in memory as UTF-16LE bytes (which is why Windows persists in confusing everyone by calling the UTF-16LE encoding just “Unicode”). 在内部,字符串以UTF-16LE字节存储在内存中(这就是为什么Windows通过调用UTF-16LE编码只是“Unicode”而使每个人感到困惑的原因)。 But you don't need to know that — to you, they're just strings of characters. 但是你不需要知道 - 对你来说,它们只是字符串。

What your function does is: 你的功能是:

  1. Takes a string and converts it to UTF-8 bytes. 采用字符串并将其转换为UTF-8字节。
  2. Takes those UTF-8 bytes and converts them to UTF-16LE bytes. 获取UTF-8字节并将其转换为UTF-16LE字节。 (You could have just encoded straight to UTF-16LE instead of UTF-8 in step one.) (您可能在第一步中直接编码为UTF-16LE而不是UTF-8。)
  3. Takes those UTF-16LE bytes and converts them back to a string. 获取那些UTF-16LE字节并将它们转换回字符串。 This gives you the exact same String you had in the first place! 这为您提供了与首先完全相同的String!

So this function is redundant; 所以这个功能是多余的; you can actually just pass a normal String to SQL Server from .NET and not worry about it. 你实际上可以从.NET传递一个普通的字符串到SQL Server,而不用担心它。

The bit with the backslashes does do something, presumably application-specific I don't understand what it's for. 带有反斜杠的位确实做了一些事情,大概是应用程序特定的我不明白它的用途。 But nothing in that function will cause Windows to flatten characters like ń to n. 但是,该功能中的任何内容都不会导致Windows将字符变为像n到n。

What /will/ cause that kind of flattening is when you try to put characters that aren't in the database's own encoding in the database. 当您尝试将不在数据库自身编码中的字符放入数据库时​​,会出现什么/将导致这种展平。 Presumably é is OK because that character is in your default encoding of cp1252 Western European, but ń is not so it gets mangled. 据推测é是可以的,因为这个角色是你的西欧cp1252的默认编码,但不是这样它会被破坏。

SQL Server does use 'UCS2' (really UTF-16LE again) to store Unicode strings, but you have tell it to, typically by using a NATIONAL CHARACTER (NCHAR/NVARCHAR) column type instead of plain CHAR. SQL Server确实使用'UCS2'(实际上是UTF-16LE)来存储Unicode字符串,但是你已经告诉它,通常使用NATIONAL CHARACTER(NCHAR / NVARCHAR)列类型而不是普通CHAR。

We were also very confused about encoding. 我们对编码也非常困惑。 Here is an useful page that explains it. 这是一个有用的页面解释它。 Also, answer to following SO question will help to explain it too - 另外,回答以下问题也有助于解释它 -

In C# String/Character Encoding what is the difference between GetBytes(), GetString() and Convert()? 在C#字符串/字符编码中GetBytes(),GetString()和Convert()之间的区别是什么?

对于使用较新版本的未来读者,请注意SQL Server 2016在其bcp实用程序中支持UTF-8。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM