简体   繁体   中英

Is there a more elegant way to change Unicode to Ascii?

I seen the problem a lot where you have some obscure unicode character which is somewhat like a certain ascii character and needs to be converted at run time for whatever reason.

In this case I am trying to export to csv. Having already used a nasty fix for dash, emdash, endash and hbar I have just recieved a new request for ' ` '. Aside from another nasty fix is there another better way to do this?

Heres what I have at the moment...

        formattedString = formattedString.Replace(char.ConvertFromUtf32(8211), "-");
        formattedString = formattedString.Replace(char.ConvertFromUtf32(8212), "-");
        formattedString = formattedString.Replace(char.ConvertFromUtf32(8213), "-");

Any Ideas?

It's a rather inelegant problem, so no method will really be deeply elegant.

Still, we can certainly improve things. Just which approach will work best will depend on the number of changes that need to be made (and the size of the string to change, though it's often best to assume this either is or could be quite large).

At one replacement character, the approach you use so far - using .Replace is superior, though I would replace char.ConvertFromUtf32(8211) with "\–" . The effect on performance is negligible but it's more readable, since it's more usual to refer to that character in hexadecimal as in U+2013 than in decimal notation (of course char.ConvertFromUtf32(0x2013) would have the same advantage there, but no advantage on just using the char notation). (One could also just put '–' straight into the code - more readable in some cases, but less so in this where it looks much the same as ‒, — or - to the reader).

I'd also replace the string replace with the marginally faster character replace (in this case at least, where you are replacing a single char with a single char).

Taking this approach to your code it becomes:

formattedString = formattedString.Replace('\u2013', '-');
formattedString = formattedString.Replace('\u2014', '-');
formattedString = formattedString.Replace('\u2015', '-');

Even with as few replacements as 3, this is likely to be less efficient than doing all such replacements in one pass (I'm not going to do a test to find how long formattedString would need to be for this, above a certain number it becomes more efficient to use a single pass even for strings of only a few characters). One approach is:

StringBuilder sb = new StringBuilder(formattedString.length);//we know this is the capacity so we initialise with it:
foreach(char c in formattedString)
  switch(c)
  {
    case '\u2013': case '\u2014': case '\u2015':
      sb.Append('-');
    default:
      sb.Append(c)
  }
formattedString = sb.ToString();

(Another possibility is to check if (int)c >= 0x2013 && (int)c <= 0x2015 but the reduction in number of branches is small, and irrelevant if most of the characters you look for aren't numerically close to each other).

With various variants (eg if formattedString is going to be output to a stream at some point, it may be best to do so as each final character is obtained, rather than buffering again).

Note that this approach doesn't deal with multi-char strings in your search, but can with strings in your output, eg we could include:

case 'ß':
  sb.Append("ss");

Now, this is more efficient than the previous, but still becomes unwieldy after a certain number of replacement cases. It also involves many branches, which have their own performance issues.

Let's consider for a moment the opposite problem. Say you wanted to convert characters from a source that was only in the US-ASCII range. You would have only 128 possible characters so your approach could be:

char[] replacements = {/*list of replacement characters*/}
StringBuilder sb = new StringBuilder(formattedString.length);
foreach(char c in formattedString)
  sb.Append(replacements[(int)c]);
formattedString = sb.ToString();

Now, this isn't practical with Unicode, which has over assigned 109,000 characters in a range going from 0 to 1114111. However, chances are the characters you care about are not only much smaller than that (and if you really did care about that many cases, you'd want the approach given just above) but also in a relatively restricted block.

Consider also if you don't especially care about any surrogates (we'll come to those later). Well, most characters you just don't care about, so, let's consider this:

char[] unchanged = new char[128];
for(int i = 0; i != 128; ++i)
  unchanged[i] = (char)i;
char[] error = new string('\uFFFD', 128).ToCharArray();
char[] block0 = (new string('\uFFFD', 13) + "---" + new string('\uFFFD', 112)).ToCharArray();

char[][] blocks = new char[8704][];
for(int i = 1; i != 8704; ++i)
  blocks[i] = error;
blocks[0] = unchanged;
blocks[64] = block0;

/* the above need only happen once, so it could be done with static members of a helper class that are initialised in a static constructor*/

StringBuilder sb = new StringBuilder(formattedString.Length);
foreach(char c in formattedString)
{
  int cAsI = (int)c;
  sb.Append(blocks[i / 128][i % 128]);
}
string ret = sb.ToString();
if(ret.IndexOf('\uFFFD') != -1)
    throw new ArgumentException("Unconvertable character");
formattedString = ret;

The balance between whether it's better to test for an uncovertable character in one go at the end (as above) or on each conversion varies according to how likely this is to happen. It's obviously even better if you can be sure (due to knowledge of your data) that it won't, and can remove that check - but you have to be really sure.

The advantage here is that while we are using a look-up method, we are only taking up 384 characters' worth of memory to hold the look-up (and some more for the array overhead) rather than 109,000 characters' worth. The best size for the blocks within this varies according to your data, (that is, what replacements you want to make), but the assumption that there will be blocks that are identical to each other tends to hold.

Now, finally, what if you care about a character in the "astral planes" which are represented as surrogate pairs in the UTF-16 used internally in .NET, or if you care about replacing some multi-char strings in a particular way?

In this case, you are probably going to have to at the very least read a character or more ahead in your switch (if using the block-method for most cases, you can use an unconvertable case to signal such work is required). In such a case, it might well be worth converting to and then back from US-ASCII with System.Text.Encoding and a custom implementation of EncoderFallback and EncoderFallbackBuffer and handle it there. This means that most of the conversion (the obvious cases) will be done for you, while your implementation can deal only with the special cases.

You could maintain a lookup table that maps your problem characters to replacement characters. For efficiency you can work on character array to prevent lots of intermediary string churn which would be a result of using string.Replace .

For example:

var lookup = new Dictionary<char, char>
{
    { '`',  '-' },
    { 'இ', '-' },
    //next pair, etc, etc
};

var input = "blah இ blah ` blah";

var r;

var result = input.Select(c => lookup.TryGetValue(c, out r) ? r : c);

string output = new string(result.ToArray());

Or if you want blanket treatment of non ASCII range characters:

string output = new string(input.Select(c => c <= 127 ? c : '-').ToArray());

Unfortunately, given that you're doing a bunch of specific transforms within your data, you will likely need to do these via replacements.

That being said, you could make a few improvements.

  1. If this is common, and the strings are long, storing these in a StringBuilder instead of a string would allow in-place replacements of the values, which could potentially improve things.
  2. You could store the conversion characters, both from and to, in a Dictionary or other structure, and perform these operations in a simple loop.
  3. You could load both the "from" and "to" character at runtime from a configuration file, instead of having to hard-code every transformation operation. Later, when more of these were requested, you wouldn't need to alter your code - it could be done via configuration.

If they are all replaced with the same string:

formattedString = string.Join("-", formattedString.Split('\u2013', '\u2014', '\u2015'));

or

foreach (char c in "\u2013\u2014\u2015") 
    formattedString = formattedString.Replace(c, '-');

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM