简体   繁体   中英

Encoding for integer keys in Azure table

I'm going to store records in Azure tables and use partition keys and/or row keys that represent integer values.

Since partition keys and row keys must be stored as strings, I need to choose an encoding scheme that translate between strings and integers.

The keys will have a range between 0 and 2 63 but most keys will have low values (typically less than 10 6 ).

I'm looking for an encoding scheme with the following properties:

  • Strings must be sortable in the same order as the corresponding integers.

  • Avoid overly long strings for common (low) values.

  • The encoding must support two modes; one that generates strings for ascending sort order; and one that generates strings for descending sort order.

Ideas:

  • Simply encode keys using 16 hexadecimal characters and use an inverse alphabet to achieve descending sort order.

    While this is a simple and straight forward approach it has the drawback that it generates overly long strings for common (low) values.

  • Use a 7-bit encoding scheme similar to that in Unicode to generate smaller strings for low values.

  • Use the fact that Azure seem to support 16-bit Unicode characters in keys. While some characters are reserved and buggy , I think it should be possible to store at least 14 significant bits per character making it possible to represent all keys with as few as 5 characters.

Any suggestions?

Don't do this (see below)


I decided that it is enough for me to support 60-bit keys and created a class that pack 15 bits per character.

In this way I'm able to store all possible keys (0 to 2 60 - 1) in just four characters.

To avoid conflicts with reserved and buggy characters I decided to use characters from the Unicode ranges 0x4000 to 0x9fff (Unified CJK Han) and 0xb000 to 0xcfff (East asian scripts).

Examples:

Integer:    String:
0x0         "䀀䀀䀀䀀"
0x123       "䀀䀀䀀䄣"
0x1000      "䀀䀀䀀倀"
0x123456    "䀀䀀䀂捅"
0x100000000 "䀀䀄䀀䀀"

This encoding generates keys that:

  • Have the same sort order as the corresponding integers
  • Are short for all keys
  • Can be used for both ascending and descending mode by just flipping the bits.

Why you shouldn't use this encoding:

While I was happy at first, since this encoding did fulfill all my requirements. My lack of experience with Azure Tables became apparent when I traced API requests.

Since partition keys and row keys are built into request URIs any encoding scheme that use characters that must be percent encoded is a bad encoding scheme.

And this scheme is entirely based on such characters. A typicaly request URI would look something like this:

http://myaccount.table.core.windows.net/MyTable(PartitionKey='',RowKey='%EC%AE%8A%E5%BC%B0%EC%92%BD%E6%B0%AB') 

As we can see the nice four character row key is sent as 37 characters!

This is what I ended up doing:

I decided to let keys be signed 64-bit integers so that I can use negative values to order keys in descending order.

I created an encoding scheme based on Base-64 with a few modifications:

  • All 64-bit (8 byte) values will be encoded as 12 Base-64 characters and the last character will always be the = padding char. So it is safe to trim away that.

  • I need to preserve natural (ordinal) sort order and the original Base-64 alphabet does not have this property.

    The original Base-64 alphabet is: A to Z , a to z , 0 to 9 and finally + and / .

    The URL friendly Base-64 alphabet simply replaces + with - and / with _ .

    I decided to use these characters but rearrange the alphabet so that it can be sorted by ordinal values.

    My alphabet is therefore: - , 0 to 9 , A to Z , _ , a to z .

  • Low absolute values are encoded with many leading A or / characters. I decided to pack these in a leading flag character as follows:

     'A': Negative value with no leading '/' characters 'B': Negative value with 1 leading '/' character 'C': Negative value with 2 leading '/' characters ... 'K': Negative value with 10 leading '/' characters 'Z': Zero (11 `A` characters) 'a': Positive value with 10 leading 'A' characters 'b': Positive value with 9 leading 'A' characters ... 'j': Positive value with 1 leading 'A' character 'k': Positive value without leading 'A' characters 

Examples

-9223372036854775807 = "AV----------"
         -2147483648 = "Frzzzzw"
             -100000 = "HyTKw"
              -10000 = "IqDw"
               -1020 = "J-B"
               -1000 = "J0R"
                -100 = "Jtg"
                 -19 = "Jyk"
                 -10 = "KJ"
                   0 = "Z"
                  10 = "ac"
                  19 = "b0B"
                 100 = "b5F"
                1000 = "byV"
                1020 = "bzk"
               10000 = "c8l-"
              100000 = "d0We-"
          2147483647 = "f6zzzzw"
 9223372036854775807 = "kUzzzzzzzzzw"

The code

public static class Int64Key
{
    /// <summary>
    ///     The minimum supported value.
    /// </summary>
    public const long MinValue = long.MinValue + 1;

    /// <summary>
    ///     The maximum supported value.
    /// </summary>
    public const long MaxValue = long.MaxValue;

    // Mapping tables to convert to/from original Base64 and the sortable variant
    private static readonly char[] B2S = new char[123];
    private static readonly char[] S2B = new char[123];

    static Int64Key()
    {
        // Base-64 alphabet
        const string B64 = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";

        // Alternative with natural (ordinal) sort order
        const string S64 = "-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz";

        // Populate mapping tables B2S and S2B
        for (int i = 0; i < 64; ++i)
        {
            B2S[B64[i]] = S64[i];
            S2B[S64[i]] = B64[i];
        }
    }

    /// <summary>
    ///     Encodes the specified integer key to its string representation.
    /// </summary>
    public static string Encode(long value)
    {
        // Check that value is within the supported range
        // Only "long.MinValue" is unsupported.
        if (value == long.MinValue)
        {
            throw new ArgumentOutOfRangeException();
        }

        bool neg = value < 0;
        byte[] data = BitConverter.GetBytes(neg ? ~(-value) : value);

        // Make sure data is big endian
        if (BitConverter.IsLittleEndian)
        {
            Array.Reverse(data);
        }

        // Get Base-64 representation
        char[] arr = new char[13];
        Convert.ToBase64CharArray(data, 0, 8, arr, 1);

        // Convert from Base-64 alphabet to the sortable variant
        // Also, count the number of leading omittable chars.
        char omitChar = neg ? '/' : 'A';
        int omitCount = 0;
        bool allOmittable = true;

        for (int i = 1; i < 12; ++i)
        {
            if (allOmittable)
            {
                if (arr[i] == omitChar)
                {
                    ++omitCount;
                }
                else
                {
                    allOmittable = false;
                }
            }

            arr[i] = B2S[arr[i]];
        }

        // Prepend the appropriate flag character
        string tab = neg ? "ABCDEFGHIJK" : "kjihgfedcbaZ";
        arr[omitCount] = tab[omitCount];

        // Create and return key string
        return new string(arr, omitCount, 12 - omitCount);
    }

    /// <summary>
    ///     Decodes the specified string key to its integer representation.
    /// </summary>
    public static long Decode(string str)
    {
        if (string.IsNullOrEmpty(str))
        {
            throw new ArgumentException();
        }

        // Interpret flag character. It tells us the number of omitted chars and whether
        // the value is positive or negative.
        char f = str[0];
        int numA;
        bool neg;

        if (f >= 'A' && f <= 'K')
        {
            numA = f - 'A';
            neg = true;
        }
        else if (f >= 'a' && f <= 'k')
        {
            numA = 'k' - f;
            neg = false;
        }
        else if (f == 'Z')
        {
            numA = 11;
            neg = false;
        }
        else
        {
            throw new ArgumentException();
        }

        char[] arr = new char[12];
        int pos;

        // Prepend the number of omitted chars
        char omitChar = neg ? '/' : 'A';
        for (pos = 0; pos < numA; ++pos)
        {
            arr[pos] = omitChar;
        }

        // Convert from the sortable alphabet to the original Base-64 alphabet
        for (int i = 1; i < str.Length; ++i, ++pos)
        {
            arr[pos] = S2B[Math.Min(122, (int)str[i])];
        }

        // Always append Base-64 padding character
        arr[11] = '=';

        // Parse Base-64
        byte[] data = Convert.FromBase64CharArray(arr, 0, 12);

        // Data is always in big endian, so we might need to swap back to little endian.
        if (BitConverter.IsLittleEndian)
        {
            Array.Reverse(data);
        }

        // Get value from bits
        long value = BitConverter.ToInt64(data, 0);

        // Negate it if needed
        return neg ? -~value : value;
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM