简体   繁体   English

修复字符串编码问题

[英]Fix string encoding issues

Does anyone know of a.Net library (NuGet package preferrably) that I can use to fix strings that are 'messed up' because of encoding issues?有谁知道我可以用来修复由于编码问题而“混乱”的字符串的.Net 库(最好是 NuGet package)?

I have Excel* files that are supplied by third parties that contain strings like:我有第三方提供的 Excel* 文件,其中包含如下字符串:

Telefónica UK Limited

Serviços de Comunicações e Multimédia

These entries are simply user-error (eg someone copy/pasted wrong or something) because elsewhere in the same file the same entries are correct:这些条目只是用户错误(例如有人复制/粘贴错误或其他),因为在同一文件的其他地方,相同的条目是正确的:

Telefónica UK Limited

Serviços de Comunicações e Multimédia

So I was wondering if there is a library/package/something that takes a string and fixes "common errors" like çõçõ and óó .所以我想知道是否有一个库/包/东西接受一个字符串并修复“常见错误”,如çõçõóó I understand that this won't be 100% fool-proof and may result in some false-negatives but it would sure be nice to have some field-tested library to help me clean up my data a bit.我知道这不会 100% 万无一失,可能会导致一些假阴性,但如果有一些经过现场测试的库来帮助我稍微清理一下数据肯定会很好。 Ideally it would 'autodetect' the issue(s) and 'autofix' them as I won't always be able to tell what the source encoding (and destination encoding) was at the time the mistake was made.理想情况下,它会“自动检测”问题并“自动修复”它们,因为我并不总是能够分辨出错误发生时源编码(和目标编码)是什么。

* The filetype is not very relevant, I may have text from other parties in other fileformats that have the same issue... * 文件类型不是很相关,我可能有来自其他方的其他文件格式的文本有同样的问题...

My best advice is to start with a list of special characters that are used in the language in question.我最好的建议是从相关语言中使用的特殊字符列表开始。

I assume you're just dealing with Portuguese or other European languages with just a handful of non-US-ASCII characters.我假设您只是处理带有少数非美国 ASCII 字符的葡萄牙语或其他欧洲语言。

I also assume you know what the bad encoding was in the first place (ie the code page), and it was always the same.我还假设您一开始就知道错误的编码是什么(即代码页),而且它总是一样的。

(If you can't assume these things, then it's a bigger problem.) (如果你不能假设这些事情,那就是一个更大的问题。)

Then encode each of these characters badly, and look for the results in your source text.然后对这些字符中的每一个进行糟糕的编码,并在源文本中查找结果。 If any are found, you can treat it as badly encoded text.如果找到任何内容,您可以将其视为编码错误的文本。

var specialCharacters = "çõéó";
var goodEncoding = Encoding.UTF8;
var badEncoding = Encoding.GetEncoding(28591);
var badStrings = specialCharacters.Select(c => badEncoding.GetString(goodEncoding.GetBytes(c.ToString())));

var sourceText = "Serviços de Comunicações e Multimédia";
if(badStrings.Any(s => sourceText.Contains(s)))
{
    sourceText = goodEncoding.GetString(badEncoding.GetBytes(sourceText));
}

The first step in fixing a bad encoding is to find what encoding the text was mis-encoded to, often this is not obvious.修复错误编码的第一步是找出文本被错误编码的编码,通常这并不明显。

So, start with a bit of text that is mis-encoded, and the corrected version of the text.因此,从一些编码错误的文本和文本的更正版本开始。 Here my badly encoded text ends with ä rather than ä这里我编码错误的文本以 ä 而不是 ä 结尾

var name = "Viistoperä";
var target = "Viistoperä";
var encs = Encoding.GetEncodings();
foreach (var encodingType in encs)
{ 
    var raw = Encoding.GetEncoding(encodingType.CodePage).GetBytes(name);
    var output = Encoding.UTF8.GetString(raw);
    if (output == target)
    {
        Console.WriteLine("{0},{1},{2}",encodingType.DisplayName, encodingType.CodePage, output);
    }
}

This will output a number of candidate encodings, and you can either pick the most relevant one.这将 output 多个候选编码,您可以选择最相关的一个。 Windows-1252 is a better candidate than Turkish in this case.在这种情况下,Windows-1252 比土耳其语更好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM