I'm going to store records in Azure tables and use partition keys and/or row keys that represent integer values.
Since partition keys and row keys must be stored as strings, I need to choose an encoding scheme that translate between strings and integers.
The keys will have a range between 0 and 2 63 but most keys will have low values (typically less than 10 6 ).
I'm looking for an encoding scheme with the following properties:
Strings must be sortable in the same order as the corresponding integers.
Avoid overly long strings for common (low) values.
The encoding must support two modes; one that generates strings for ascending sort order; and one that generates strings for descending sort order.
Ideas:
Simply encode keys using 16 hexadecimal characters and use an inverse alphabet to achieve descending sort order.
While this is a simple and straight forward approach it has the drawback that it generates overly long strings for common (low) values.
Use a 7-bit encoding scheme similar to that in Unicode to generate smaller strings for low values.
Use the fact that Azure seem to support 16-bit Unicode characters in keys. While some characters are reserved and buggy , I think it should be possible to store at least 14 significant bits per character making it possible to represent all keys with as few as 5 characters.
Any suggestions?
Don't do this (see below)
I decided that it is enough for me to support 60-bit keys and created a class that pack 15 bits per character.
In this way I'm able to store all possible keys (0 to 2 60 - 1) in just four characters.
To avoid conflicts with reserved and buggy characters I decided to use characters from the Unicode ranges 0x4000 to 0x9fff (Unified CJK Han) and 0xb000 to 0xcfff (East asian scripts).
Examples:
Integer: String:
0x0 "䀀䀀䀀䀀"
0x123 "䀀䀀䀀䄣"
0x1000 "䀀䀀䀀倀"
0x123456 "䀀䀀䀂捅"
0x100000000 "䀀䀄䀀䀀"
This encoding generates keys that:
Why you shouldn't use this encoding:
While I was happy at first, since this encoding did fulfill all my requirements. My lack of experience with Azure Tables became apparent when I traced API requests.
Since partition keys and row keys are built into request URIs any encoding scheme that use characters that must be percent encoded is a bad encoding scheme.
And this scheme is entirely based on such characters. A typicaly request URI would look something like this:
http://myaccount.table.core.windows.net/MyTable(PartitionKey='',RowKey='%EC%AE%8A%E5%BC%B0%EC%92%BD%E6%B0%AB')
As we can see the nice four character row key is sent as 37 characters!
This is what I ended up doing:
I decided to let keys be signed 64-bit integers so that I can use negative values to order keys in descending order.
I created an encoding scheme based on Base-64 with a few modifications:
All 64-bit (8 byte) values will be encoded as 12 Base-64 characters and the last character will always be the =
padding char. So it is safe to trim away that.
I need to preserve natural (ordinal) sort order and the original Base-64 alphabet does not have this property.
The original Base-64 alphabet is: A
to Z
, a
to z
, 0
to 9
and finally +
and /
.
The URL friendly Base-64 alphabet simply replaces +
with -
and /
with _
.
I decided to use these characters but rearrange the alphabet so that it can be sorted by ordinal values.
My alphabet is therefore: -
, 0
to 9
, A
to Z
, _
, a
to z
.
Low absolute values are encoded with many leading A
or /
characters. I decided to pack these in a leading flag character as follows:
'A': Negative value with no leading '/' characters 'B': Negative value with 1 leading '/' character 'C': Negative value with 2 leading '/' characters ... 'K': Negative value with 10 leading '/' characters 'Z': Zero (11 `A` characters) 'a': Positive value with 10 leading 'A' characters 'b': Positive value with 9 leading 'A' characters ... 'j': Positive value with 1 leading 'A' character 'k': Positive value without leading 'A' characters
Examples
-9223372036854775807 = "AV----------"
-2147483648 = "Frzzzzw"
-100000 = "HyTKw"
-10000 = "IqDw"
-1020 = "J-B"
-1000 = "J0R"
-100 = "Jtg"
-19 = "Jyk"
-10 = "KJ"
0 = "Z"
10 = "ac"
19 = "b0B"
100 = "b5F"
1000 = "byV"
1020 = "bzk"
10000 = "c8l-"
100000 = "d0We-"
2147483647 = "f6zzzzw"
9223372036854775807 = "kUzzzzzzzzzw"
The code
public static class Int64Key
{
/// <summary>
/// The minimum supported value.
/// </summary>
public const long MinValue = long.MinValue + 1;
/// <summary>
/// The maximum supported value.
/// </summary>
public const long MaxValue = long.MaxValue;
// Mapping tables to convert to/from original Base64 and the sortable variant
private static readonly char[] B2S = new char[123];
private static readonly char[] S2B = new char[123];
static Int64Key()
{
// Base-64 alphabet
const string B64 = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
// Alternative with natural (ordinal) sort order
const string S64 = "-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz";
// Populate mapping tables B2S and S2B
for (int i = 0; i < 64; ++i)
{
B2S[B64[i]] = S64[i];
S2B[S64[i]] = B64[i];
}
}
/// <summary>
/// Encodes the specified integer key to its string representation.
/// </summary>
public static string Encode(long value)
{
// Check that value is within the supported range
// Only "long.MinValue" is unsupported.
if (value == long.MinValue)
{
throw new ArgumentOutOfRangeException();
}
bool neg = value < 0;
byte[] data = BitConverter.GetBytes(neg ? ~(-value) : value);
// Make sure data is big endian
if (BitConverter.IsLittleEndian)
{
Array.Reverse(data);
}
// Get Base-64 representation
char[] arr = new char[13];
Convert.ToBase64CharArray(data, 0, 8, arr, 1);
// Convert from Base-64 alphabet to the sortable variant
// Also, count the number of leading omittable chars.
char omitChar = neg ? '/' : 'A';
int omitCount = 0;
bool allOmittable = true;
for (int i = 1; i < 12; ++i)
{
if (allOmittable)
{
if (arr[i] == omitChar)
{
++omitCount;
}
else
{
allOmittable = false;
}
}
arr[i] = B2S[arr[i]];
}
// Prepend the appropriate flag character
string tab = neg ? "ABCDEFGHIJK" : "kjihgfedcbaZ";
arr[omitCount] = tab[omitCount];
// Create and return key string
return new string(arr, omitCount, 12 - omitCount);
}
/// <summary>
/// Decodes the specified string key to its integer representation.
/// </summary>
public static long Decode(string str)
{
if (string.IsNullOrEmpty(str))
{
throw new ArgumentException();
}
// Interpret flag character. It tells us the number of omitted chars and whether
// the value is positive or negative.
char f = str[0];
int numA;
bool neg;
if (f >= 'A' && f <= 'K')
{
numA = f - 'A';
neg = true;
}
else if (f >= 'a' && f <= 'k')
{
numA = 'k' - f;
neg = false;
}
else if (f == 'Z')
{
numA = 11;
neg = false;
}
else
{
throw new ArgumentException();
}
char[] arr = new char[12];
int pos;
// Prepend the number of omitted chars
char omitChar = neg ? '/' : 'A';
for (pos = 0; pos < numA; ++pos)
{
arr[pos] = omitChar;
}
// Convert from the sortable alphabet to the original Base-64 alphabet
for (int i = 1; i < str.Length; ++i, ++pos)
{
arr[pos] = S2B[Math.Min(122, (int)str[i])];
}
// Always append Base-64 padding character
arr[11] = '=';
// Parse Base-64
byte[] data = Convert.FromBase64CharArray(arr, 0, 12);
// Data is always in big endian, so we might need to swap back to little endian.
if (BitConverter.IsLittleEndian)
{
Array.Reverse(data);
}
// Get value from bits
long value = BitConverter.ToInt64(data, 0);
// Negate it if needed
return neg ? -~value : value;
}
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.