简体   繁体   English

如何将数据库中存储在 ANSI (Windows 1252) 中的值转换为 UTF-8

[英]How can I convert values stored in ANSI (Windows 1252) in a database to UTF-8

When I open a legacy database in Sqlite Browser, the text is already displayed wrong.当我在 Sqlite 浏览器中打开旧数据库时,文本已显示错误。 The only encoding I can set is UTF-8 and UTF-16.我可以设置的唯一编码是 UTF-8 和 UTF-16。
带有元音变音的 Sqlite 浏览器

When I query the database, the encoding is already wrong in Visual Studio.当我查询数据库时,Visual Studio 中的编码已经错误。
Visual Studio 本地人

I assume the text is encoded in ANSI (Windows-1252) ( confirmed in the comments ).我假设文本是用 ANSI (Windows-1252) 编码的(在评论中确认)。 I tried converting it to UTF-8我尝试将其转换为 UTF-8

        var encoding = Encoding.GetEncoding(1252);
        byte[] encBytes = encoding.GetBytes(result);
        byte[] utf8Bytes = Encoding.Convert(encoding, Encoding.UTF8, encBytes);
        return Encoding.UTF8.GetString(utf8Bytes);

but now the question mark symbol is just a question mark.但现在问号符号只是一个问号。
还是错

Somehow, the external legacy app displays it correctly, so there appears to be a way.不知何故,外部遗留应用程序正确显示它,所以似乎有一种方法。 But I'm not sure what I can try next.但我不确定接下来我可以尝试什么。

I had the same problem once,我曾经遇到过同样的问题,

John Skeet answered it here :约翰斯基特在这里回答:

Basically take the string, get the bytes in the wrong encoding that it was encoded as, then get the string in the encoding that it really was:基本上取字符串,以错误的编码方式获取字节,然后以实际的编码方式获取字符串:

string broken = "Brokers México, Intermediario de Aseguro,S.A."; // Get text from database
byte[] encoded = Encoding.GetEncoding(28591).GetBytes(broken);
string corrected = Encoding.UTF8.GetString(encoded);

So yours should simply be所以你的应该只是

string broken = "Whatever";
byte[] encoded = Encoding.GetEncoding(1252).GetBytes(broken);
string corrected = Encoding.UTF8.GetString(encoded);

Basically, now that you know that the re-conversion program is correct, I'd play around with the encodings mentioned here:基本上,既然您知道重新转换程序是正确的,我将使用这里提到的编码:
https://msdn.microsoft.com/en-us/library/system.text.encodinginfo.getencoding(v=vs.110).aspx https://msdn.microsoft.com/en-us/library/system.text.encodinginfo.getencoding(v=vs.110).aspx
(just write a program that tests-through all the likely possibilities listed there, and see which pair will yield a match...) (只需编写一个程序,通过那里列出的所有可能的可能性进行测试,看看哪一对会产生匹配......)

If you know the source text, you could even perform the checking automagically:如果您知道源文本,您甚至可以自动执行检查:

public partial class Form1 : Form
{
    public System.Data.DataTable dt;

    public Form1()
    {
        InitializeComponent();
    }




    private void btnTest_Click(object sender, EventArgs e)
    {
        dt = new System.Data.DataTable();

        string correct = "Brokers México, Intermediario de Aseguro,S.A.";

        string broken = "Brokers México, Intermediario de Aseguro,S.A."; // Get text from database

        dt.Columns.Add("SourceEncoding", typeof(string));
        dt.Columns.Add("TargetEncoding", typeof(string));
        dt.Columns.Add("Result", typeof(string));
        dt.Columns.Add("SourceEncodingName", typeof(string));
        dt.Columns.Add("TargetEncodingName", typeof(string));

        // For reference
        // https://msdn.microsoft.com/en-us/library/system.text.encodinginfo.getencoding(v=vs.110).aspx
        int[] encs = new int[] { 
             20127 // US-ASCII
            ,28591 // iso-8859-1 Western European (ISO)       
            ,28592 // iso-8859-2 Central European (ISO)       
            ,28593 // iso-8859-3 Latin 3 (ISO)
            ,28594 // iso-8859-4 Baltic (ISO)
            ,28595 // iso-8859-5 Cyrillic (ISO)
            ,28596 // iso-8859-6 Arabic (ISO)
            ,28597 // iso-8859-7 Greek (ISO)
            ,28598 // iso-8859-8 Hebrew (ISO-Visual)          
            ,28599 // iso-8859-9 Turkish (ISO)
            ,28603 // iso-8859-13 Estonian (ISO)
            ,28605 // iso-8859-15 Latin 9 (ISO)   

            ,1250 // windows-1250 Central European (Windows)      
            ,1251 // windows-1251 Cyrillic (Windows)             
            ,1252 // Windows-1252 Western European (Windows)      
            ,1253 // windows-1253 Greek (Windows)                
            ,1254 // windows-1254 Turkish (Windows)              
            ,1255 // windows-1255 Hebrew (Windows)               
            ,1256 // windows-1256 Arabic (Windows)               
            ,1257 // windows-1257 Baltic (Windows)               
            ,1258 // windows-1258 Vietnamese (Windows)

            ,20866 // Cyrillic (KOI8-R)
            ,21866 // Cyrillic (KOI8-U)  

            ,65000 // UTF-7
            ,65001 // UTF-8
            ,1200 // UTF-16
            ,1201 // Unicode (Big-Endian)    

            ,12000 // UTF-32
            ,12001 // UTF-32BE (UTF-32 Big-Endian) 
        };


        for (int i = 0; i < encs.Length; ++i)
        {

            for (int j = 0; j < encs.Length; ++j)
            {
                System.Data.DataRow dr = dt.NewRow();

                dr["SourceEncoding"] = encs[i];
                dr["TargetEncoding"] = encs[j];


                System.Text.Encoding enci = Encoding.GetEncoding(encs[i]);
                System.Text.Encoding encj = Encoding.GetEncoding(encs[j]);

                byte[] encoded = enci.GetBytes(broken);
                string corrected = encj.GetString(encoded);

                dr["Result"] = corrected;

                dr["SourceEncodingName"] = enci.BodyName;
                dr["TargetEncodingName"] = encj.BodyName;


                if (StringComparer.InvariantCultureIgnoreCase.Equals(correct, corrected))
                    dt.Rows.Add(dr);
            }

        }

        this.dataGridView1.DataSource = dt;
    }
}

Or even more thorough, just test all encodings:或者更彻底,只需测试所有编码:

private void btnTestAll_Click(object sender, EventArgs e)
{
    dt = new System.Data.DataTable();

    string correct = "Brokers México, Intermediario de Aseguro,S.A.";

    string broken = "Brokers México, Intermediario de Aseguro,S.A."; // Get text from database

    dt.Columns.Add("SourceEncoding", typeof(string));
    dt.Columns.Add("TargetEncoding", typeof(string));
    dt.Columns.Add("Result", typeof(string));
    dt.Columns.Add("SourceEncodingName", typeof(string));
    dt.Columns.Add("TargetEncodingName", typeof(string));



    System.Text.EncodingInfo[] encs = System.Text.Encoding.GetEncodings();

    for (int i = 0; i < encs.Length; ++i)
    {

        for (int j = 0; j < encs.Length; ++j)
        {
            System.Data.DataRow dr = dt.NewRow();

            dr["SourceEncoding"] = encs[i].CodePage;
            dr["TargetEncoding"] = encs[j].CodePage;


            System.Text.Encoding enci = System.Text.Encoding.GetEncoding(encs[i].CodePage);
            System.Text.Encoding encj = System.Text.Encoding.GetEncoding(encs[j].CodePage);

            byte[] encoded = enci.GetBytes(broken);
            string corrected = encj.GetString(encoded);

            dr["Result"] = corrected;

            dr["SourceEncodingName"] = enci.BodyName;
            dr["TargetEncodingName"] = encj.BodyName;


            if (StringComparer.InvariantCultureIgnoreCase.Equals(correct, corrected))
                dt.Rows.Add(dr);
        }

    }

    this.dataGridView1.DataSource = dt;
}

You can download the result here :你可以在这里下载结果:

It's strange, it looks like you can get from German/ANSI (or ISO-8859-1) to ASCII, but there is NO WAY to convert it back (information loss)...很奇怪,看起来您可以从德语/ANSI(或 ISO-8859-1)转换为 ASCII,但无法将其转换回来(信息丢失)...

public static string lol()
{
    string source = "Alu-Dreieckstütze";

    // System.Text.Encoding encSource = System.Text.Encoding.Default;
    System.Text.Encoding encSource = System.Text.Encoding.GetEncoding(28591);
    System.Text.Encoding encTarget = System.Text.Encoding.ASCII;

    byte[] encoded = encSource.GetBytes(source);
    string broken = encTarget.GetString(encoded);

    return broken;
}

The funny thing is, since the legacy app displays it correctly, it can't have lost the information.有趣的是,由于旧版应用程序正确显示了它,它不可能丢失信息。

So are you sure you haven't put a wrong (or no) encoding in the Sqlite connectionString ?那么你确定你没有在 Sqlite connectionString 中输入错误(或没有)编码?

eg例如

  "Data Source=C:\\Users\\USERNAME\\Desktop\\location.db; Version=3; UseUTF16Encoding=True;Synchronous=Normal;New=False"; // set up the connection string

https://www.sqlite.org/c3ref/c_any.html https://www.sqlite.org/c3ref/c_any.html

It looks like you can test the encoding with pragma encoding看起来您可以使用pragma encoding测试编码

2 steps: 2个步骤:
First, you read the value from database as bytes array.首先,您从数据库中读取值作为字节数组。
Second, you convert the bytes array with 1252 encoding into string.其次,您将具有 1252 编码的字节数组转换为字符串。
Something like this:像这样的东西:

byte[] buffer = dataReader["colomnName"];
var encoding = Encoding.GetEncoding(28591);
string s = encoding.GetString(buffer);

I also do get to import data from a source that encodes strings wrongly.我也确实可以从错误编码字符串的源导入数据。 But with the Microsoft.Data.SQLite library it's quite easy to inject a user defined function to fix the encoding.但是使用Microsoft.Data.SQLite库,可以很容易地注入用户定义的函数来修复编码。 I am also using Dapper in that example:我也在那个例子中使用了Dapper

using (var cnn = new SqliteConnection($"Data Source={databasePath}")) {
    cnn.CreateFunction("fixencoding", (byte[] value) =>
        Encoding.GetEncoding(1252).GetString(value), isDeterministic: true);
    cnn.Open();
    return cnn.Query<Board>(Properties.Resources.GetBoards);
}

For this class:对于这个类:

public class Board
{
    public string Code { get; set; }
    public string Description { get; set }
    public decimal Length { get; set; }
    public decimal Width { get; set; }
    public decimal Thickness { get; set; }
    public int Quantity { get; set; }
}

and this query ( Properties.Resources.GetBoards ):和这个查询( Properties.Resources.GetBoards ):

SELECT
  fixencoding(CODE) AS Code,
  fixencoding(DESC) AS Description,
  LNGT AS Length,
  WIDT AS Width,
  THCK AS Thickness,
  QNTY AS Quantity
FROM
  BOARDS

If the source uses the same system locale it's possible to use just Encoding.Default.GetString(value) instead of Encoding.GetEncoding(1252).GetString(value) .如果源使用相同的系统区域设置,则可以仅使用Encoding.Default.GetString(value)而不是Encoding.GetEncoding(1252).GetString(value)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM