简体   繁体   中英

Convert string to ASCII without exceptions (like TryParse)

I am implementing a TryParse() method for an ASCII string class. The method takes a string and converts it to a C-style string (ie a null-terminated ASCII string).

I had been using only a Parse() , doing the conversion to ASCII using::

public static bool Parse(string s, out byte[] result)
{
    result = null;
    if (s == null || s.Length < 1)
        return false;

    byte[]d = new byte[s.Length + 1]; // Add space for null-terminator
    System.Text.Encoding.ASCII.GetBytes(s).CopyTo(d, 0); 
    // GetBytes can throw exceptions 
    // (so can CopyTo() but I can replace that with a loop)
    result = d;
    return true;
}

However, as part of the idea of a TryParse is to remove the overhead of exceptions, and GetBytes() throws exceptions, I'm looking for a different method that does not do so.

Maybe there is a TryGetbytes() -like method?

Or maybe we can reason about the expected format of a standard .Net string and perform the change mathematically (I'm not overly familiar with UTF encodings)?

EDIT: I guess for non-ASCII chars in the string, the TryParse() method should return false

EDIT: I expect when I get around to implementing the ToString() method for this class I may need to do the reverse there.

There are two possible exceptions that Encoding.GetBytes might throw according to the documentation .

ArgumentNullException is easily avoided. Do a null check on your input and you can ensure this is never thrown.

EncoderFallbackException needs a bit more investigation... Reading the documentation:

A fallback strategy determines how an encoder handles invalid characters or how a decoder handles invalid bytes.

And if we looking in the documentation for ASCII encoding we see this:

It uses replacement fallback to replace each string that it cannot encode and each byte that it cannot decode with a question mark ("?") character.

That means it doesn't use the Exception Fallback and thus will never throw an EncoderFallbackException .

So in summary if you are using ASCII encoding and ensure you don't pass in a null string then you will never have an exception thrown by the call to GetBytes .

Two options:

You could just ignore Encoding entirely, and write the loop yourself:

public static bool TryParse(string s, out byte[] result)
{
    result = null;
    // TODO: It's not clear why you don't want to be able to convert an empty string
    if (s == null || s.Length < 1)
    {
        return false;
    }

    byte buffer = new byte[s.Length + 1]; // Add space for null-terminator
    for (int i = 0; i < s.Length; i++)
    {
        char c = s[i];
        if (c > 127)
        {
            return false;
        }
        buffer[i] = (byte) c;
    }
    result = buffer;
    return true;
}

That's simple, but may be slightly slower than using Encoding.GetBytes .

The second option would be to use a custom EncoderFallback :

public static bool TryParse(string s, out byte[] result)
{
    result = null;
    // TODO: It's not clear why you don't want to be able to convert an empty string
    if (s == null || s.Length < 1)
    {
        return false;
    }

    var fallback = new CustomFallback();
    var encoding = new ASCIIEncoding { EncoderFallback = fallback };
    byte buffer = new byte[s.Length + 1]; // Add space for null-terminator
    // Use overload of Encoding.GetBytes that writes straight into the buffer
    encoding.GetBytes(s, 0, s.Length, buffer, 0);
    if (fallback.HadErrors)
    {
        return false;
    }
    result = buffer;
    return true;
}

That would require writing CustomFallback though - it would need to basically keep track of whether it had ever been asked to handle invalid input.

If you didn't mind an encoding processing the data twice, you could call Encoding.GetByteCount with a UTF-8-based encoding with a replacement fallback (with a non-ASCII replacement character), and check whether that returns the same number of bytes as the number of chars in the string. If it does, call Encoding.ASCII.GetBytes .

Personally I'd go for the first option unless you have reason to believe it's too slow.

The GetBytes method is throwing an exception because your Encoding.EncoderFallback specifies that it should throw an exception.

Create an encoding object with EncoderReplacementFallback to avoid exceptions on unencodable characters.

Encoding encodingWithFallback = new ASCIIEncoding() { DecoderFallback = DecoderFallback.ReplacementFallback };
encodingWithFallback.GetBytes("Hɘ££o wor£d!");

This way imitates the TryParse methods of the primitive .NET value types:

bool TryEncodingToASCII(string s, out byte[] result)
{
    if (s == null || Regex.IsMatch(s, "[^\x00-\x7F]")) // If a single ASCII character is found, return false.
    {
        result = null;
        return false;
    }
    result = Encoding.ASCII.GetBytes(s); // Convert the string to ASCII bytes.
    return true;
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM