简体   繁体   English

是否有更优雅的方式将Unicode更改为Ascii?

[英]Is there a more elegant way to change Unicode to Ascii?

I seen the problem a lot where you have some obscure unicode character which is somewhat like a certain ascii character and needs to be converted at run time for whatever reason. 我看到了很多问题,你有一些模糊的unicode字符,有点像某个ascii字符,需要在运行时因任何原因进行转换。

In this case I am trying to export to csv. 在这种情况下,我试图导出到csv。 Having already used a nasty fix for dash, emdash, endash and hbar I have just recieved a new request for ' ` '. 已经对dash,emdash,endash和hbar使用了一个讨厌的修复程序,我刚收到了一个新的'''请求。 Aside from another nasty fix is there another better way to do this? 除了另一个讨厌的解决方案还有另一种更好的方法吗?

Heres what I have at the moment... 这是我现在所拥有的......

        formattedString = formattedString.Replace(char.ConvertFromUtf32(8211), "-");
        formattedString = formattedString.Replace(char.ConvertFromUtf32(8212), "-");
        formattedString = formattedString.Replace(char.ConvertFromUtf32(8213), "-");

Any Ideas? 有任何想法吗?

It's a rather inelegant problem, so no method will really be deeply elegant. 这是一个相当不优雅的问题,所以没有任何方法会真正优雅。

Still, we can certainly improve things. 不过,我们当然可以改善一切。 Just which approach will work best will depend on the number of changes that need to be made (and the size of the string to change, though it's often best to assume this either is or could be quite large). 哪种方法最有效将取决于需要进行的更改的数量(以及要更改的字符串的大小,尽管通常最好假设这个或者可能非常大)。

At one replacement character, the approach you use so far - using .Replace is superior, though I would replace char.ConvertFromUtf32(8211) with "\–" . 在一个替换字符中,到目前为止使用的方法 - 使用.Replace是优越的,但我将char.ConvertFromUtf32(8211)替换为"\–" The effect on performance is negligible but it's more readable, since it's more usual to refer to that character in hexadecimal as in U+2013 than in decimal notation (of course char.ConvertFromUtf32(0x2013) would have the same advantage there, but no advantage on just using the char notation). 对性能的影响可以忽略不计,但它更具可读性,因为在U + 2013中引用十六进制字符比在十进制表示法中更常见(当然char.ConvertFromUtf32(0x2013)在那里具有相同的优势,但没有优势只使用char表示法)。 (One could also just put '–' straight into the code - more readable in some cases, but less so in this where it looks much the same as ‒, — or - to the reader). (也可以将'–'直接放入代码中 - 在某些情况下更具可读性,但在这方面它看起来与 - , - 或 - 对读者来说差不多。

I'd also replace the string replace with the marginally faster character replace (in this case at least, where you are replacing a single char with a single char). 我还将替换字符串替换为略微更快的字符替换(至少在这种情况下,您使用单个字符替换单个字符)。

Taking this approach to your code it becomes: 将此方法应用于您的代码将变为:

formattedString = formattedString.Replace('\u2013', '-');
formattedString = formattedString.Replace('\u2014', '-');
formattedString = formattedString.Replace('\u2015', '-');

Even with as few replacements as 3, this is likely to be less efficient than doing all such replacements in one pass (I'm not going to do a test to find how long formattedString would need to be for this, above a certain number it becomes more efficient to use a single pass even for strings of only a few characters). 即使只有3的替换很少,这可能比在一次传递中完成所有这样的替换效率低一些(我不打算进行测试以找出为什么formattedString需要多长时间,超过一定数量它即使对于只有几个字符的字符串,使用单个传递也会变得更有效率。 One approach is: 一种方法是:

StringBuilder sb = new StringBuilder(formattedString.length);//we know this is the capacity so we initialise with it:
foreach(char c in formattedString)
  switch(c)
  {
    case '\u2013': case '\u2014': case '\u2015':
      sb.Append('-');
    default:
      sb.Append(c)
  }
formattedString = sb.ToString();

(Another possibility is to check if (int)c >= 0x2013 && (int)c <= 0x2015 but the reduction in number of branches is small, and irrelevant if most of the characters you look for aren't numerically close to each other). (另一种可能性是检查if (int)c >= 0x2013 && (int)c <= 0x2015但是分支数量的减少很小,如果你寻找的大多数字符在数值上彼此不相近则无关紧要)。

With various variants (eg if formattedString is going to be output to a stream at some point, it may be best to do so as each final character is obtained, rather than buffering again). 使用各种变体(例如,如果formatString将在某个时刻输出到流,则最好在获得每个最终字符时这样做,而不是再次缓冲)。

Note that this approach doesn't deal with multi-char strings in your search, but can with strings in your output, eg we could include: 请注意,此方法不会处理搜索中的多字符串,但可以在输出中使用字符串,例如,我们可以包括:

case 'ß':
  sb.Append("ss");

Now, this is more efficient than the previous, but still becomes unwieldy after a certain number of replacement cases. 现在,这比以前更有效,但在一定数量的替换案例后仍然变得难以处理。 It also involves many branches, which have their own performance issues. 它还涉及许多分支机构,它们都有自己的性能问题。

Let's consider for a moment the opposite problem. 让我们考虑一下相反的问题。 Say you wanted to convert characters from a source that was only in the US-ASCII range. 假设您想要转换仅在US-ASCII范围内的来源的字符。 You would have only 128 possible characters so your approach could be: 您将只有128个可能的字符,因此您的方法可能是:

char[] replacements = {/*list of replacement characters*/}
StringBuilder sb = new StringBuilder(formattedString.length);
foreach(char c in formattedString)
  sb.Append(replacements[(int)c]);
formattedString = sb.ToString();

Now, this isn't practical with Unicode, which has over assigned 109,000 characters in a range going from 0 to 1114111. However, chances are the characters you care about are not only much smaller than that (and if you really did care about that many cases, you'd want the approach given just above) but also in a relatively restricted block. 现在,这对于Unicode来说是不实用的,它在0到1114111的范围内分配了超过109,000个字符。但是,你关心的字符很可能不仅仅比那个小得多(如果你真的关心它的话)很多情况下,你想要上面给出的方法),但也需要一个相对有限的块。

Consider also if you don't especially care about any surrogates (we'll come to those later). 如果你不特别关心任何代理人,我们也会考虑(我们将在稍后介绍)。 Well, most characters you just don't care about, so, let's consider this: 好吧,大多数人物你都不在乎,所以,让我们考虑一下:

char[] unchanged = new char[128];
for(int i = 0; i != 128; ++i)
  unchanged[i] = (char)i;
char[] error = new string('\uFFFD', 128).ToCharArray();
char[] block0 = (new string('\uFFFD', 13) + "---" + new string('\uFFFD', 112)).ToCharArray();

char[][] blocks = new char[8704][];
for(int i = 1; i != 8704; ++i)
  blocks[i] = error;
blocks[0] = unchanged;
blocks[64] = block0;

/* the above need only happen once, so it could be done with static members of a helper class that are initialised in a static constructor*/

StringBuilder sb = new StringBuilder(formattedString.Length);
foreach(char c in formattedString)
{
  int cAsI = (int)c;
  sb.Append(blocks[i / 128][i % 128]);
}
string ret = sb.ToString();
if(ret.IndexOf('\uFFFD') != -1)
    throw new ArgumentException("Unconvertable character");
formattedString = ret;

The balance between whether it's better to test for an uncovertable character in one go at the end (as above) or on each conversion varies according to how likely this is to happen. 在最后一次(如上所述)或每次转换中是否更好地测试不可动摇的角色之间的平衡取决于这种情况发生的可能性。 It's obviously even better if you can be sure (due to knowledge of your data) that it won't, and can remove that check - but you have to be really sure. 如果您可以确定(由于您的数据知识)它不会,并且可以删除该检查,显然会更好 - 但您必须非常确定。

The advantage here is that while we are using a look-up method, we are only taking up 384 characters' worth of memory to hold the look-up (and some more for the array overhead) rather than 109,000 characters' worth. 这里的优点是,当我们使用查找方法时,我们只占用384个字符的内存来保存查找(还有一些用于数组开销),而不是109,000个字符。 The best size for the blocks within this varies according to your data, (that is, what replacements you want to make), but the assumption that there will be blocks that are identical to each other tends to hold. 其中块的最佳大小根据您的数据而变化(即,您想要进行哪些替换),但假设存在彼此相同的块则倾向于保持不变。

Now, finally, what if you care about a character in the "astral planes" which are represented as surrogate pairs in the UTF-16 used internally in .NET, or if you care about replacing some multi-char strings in a particular way? 现在,最后,如果您关心“星体平面”中的一个字符,它在.NET内部使用的UTF-16中表示为代理对,或者您是否关心以特定方式替换某些多字符串?

In this case, you are probably going to have to at the very least read a character or more ahead in your switch (if using the block-method for most cases, you can use an unconvertable case to signal such work is required). 在这种情况下,您可能必须至少在开关中读取一个或更多字符(如果在大多数情况下使用块方法,则可以使用不可转换的情况来指示此类工作是必需的)。 In such a case, it might well be worth converting to and then back from US-ASCII with System.Text.Encoding and a custom implementation of EncoderFallback and EncoderFallbackBuffer and handle it there. 在这种情况下,使用System.Text.Encoding以及EncoderFallbackEncoderFallbackBuffer的自定义实现转换为US-ASCII,然后在那里处理它可能是值得的。 This means that most of the conversion (the obvious cases) will be done for you, while your implementation can deal only with the special cases. 这意味着大部分转换(明显的情况)都将为您完成,而您的实现只能处理特殊情况。

You could maintain a lookup table that maps your problem characters to replacement characters. 您可以维护一个查找表,将问题字符映射到替换字符。 For efficiency you can work on character array to prevent lots of intermediary string churn which would be a result of using string.Replace . 为了提高效率,您可以处理字符数组,以防止大量中间字符串流失,这是使用string.Replace的结果。

For example: 例如:

var lookup = new Dictionary<char, char>
{
    { '`',  '-' },
    { 'இ', '-' },
    //next pair, etc, etc
};

var input = "blah இ blah ` blah";

var r;

var result = input.Select(c => lookup.TryGetValue(c, out r) ? r : c);

string output = new string(result.ToArray());

Or if you want blanket treatment of non ASCII range characters: 或者,如果您想要对非ASCII范围字符进行全面处理:

string output = new string(input.Select(c => c <= 127 ? c : '-').ToArray());

Unfortunately, given that you're doing a bunch of specific transforms within your data, you will likely need to do these via replacements. 不幸的是,鉴于您在数据中进行了大量特定的转换,您可能需要通过替换来完成这些转换。

That being said, you could make a few improvements. 话虽这么说,你可以做一些改进。

  1. If this is common, and the strings are long, storing these in a StringBuilder instead of a string would allow in-place replacements of the values, which could potentially improve things. 如果这很常见,并且字符串很长,那么将它们存储在StringBuilder而不是字符串中将允许就地替换值,这可能会改善一些事情。
  2. You could store the conversion characters, both from and to, in a Dictionary or other structure, and perform these operations in a simple loop. 您可以在Dictionary或其他结构中存储转换字符,包括from和to,并在一个简单的循环中执行这些操作。
  3. You could load both the "from" and "to" character at runtime from a configuration file, instead of having to hard-code every transformation operation. 您可以在运行时从配置文件加载“from”和“to”字符,而不必对每个转换操作进行硬编码。 Later, when more of these were requested, you wouldn't need to alter your code - it could be done via configuration. 之后,当需要更多这些时,您不需要更改代码 - 可以通过配置完成。

If they are all replaced with the same string: 如果它们全部替换为相同的字符串:

formattedString = string.Join("-", formattedString.Split('\u2013', '\u2014', '\u2015'));

or 要么

foreach (char c in "\u2013\u2014\u2015") 
    formattedString = formattedString.Replace(c, '-');

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM