简体   繁体   English

如何将二进制数据编码为任意文本表示?

[英]How to encode binary data as any arbitrary text representation?

I need a pair of functions for encoding binary data as any arbitrary text representation, and decoding it back我需要一对用于将二进制数据编码为任意文本表示并将其解码回来的函数

Say we have an ArrayBuffer of any size:假设我们有一个任意大小的 ArrayBuffer:

const buffer = new ArrayBuffer(1000)

Then we define a hexadecimal "lingo", and use it for encoding and decoding hex strings:然后我们定义一个十六进制的“lingo”,并用它来编码和解码十六进制字符串:

const lingo = "0123456789abcdef"

const text = encode(buffer, lingo)
const data = decode(text, lingo)

My goal is to define my own base48 "lingo", which omits vowels to avoid naughty words:我的目标是定义我自己的 base48 “行话”,它省略了元音以避免淘气的话:

const lingo = "256789bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ"

const text = encode(buffer, lingo)
const data = decode(text, lingo)

How can we approach creating the algorithms for efficiently transforming data between arbitrary representations?我们如何创建算法以有效地在任意表示之间转换数据? Even though this strikes me as something quite fundamental, I'm having a hard time finding resources to help me with this task尽管这让我觉得很重要,但我很难找到资源来帮助我完成这项任务

Bonus points if you can think of any plausible naughty words without any vowels, I even took out the numbers that look like vowels!如果您能想到任何没有元音的看似合理的顽皮词,则可以加分,我什至取出了看起来像元音的数字!

I'm working in javascript, but I'd also like to understand the principals in general.我在 javascript 工作,但我也想了解一般的原则。 Thanks!谢谢!

The challenge with streaming a series of bytes/digits and converting to another base is finding the most efficient ratio of source bytes/digits to target bytes/digits.流式传输一系列字节/数字并转换为另一个基数的挑战是找到源字节/数字与目标字节/数字的最有效比率。

To determine the best ratio, the algorithm below contains a function dubbed mostEfficientChunk() which takes as parameters the source number base, the target number base, and the maximum source chunk size.为了确定最佳比率,下面的算法包含一个名为mostEfficientChunk()的 function,它以源数基数、目标数基数和最大源块大小作为参数。 This function then walks the source chunk sizes from 1 to the maximum chunk size, and determines the minimum number of bytes/digits required by the target number base.这个 function 然后将源块大小从 1 遍历到最大块大小,并确定目标数基所需的最小字节数/位数。 Eg, a source of Unit8Array whereby 1 byte ranges from 0 - 255 requires 3 bytes if converting to base 10. In this example, the efficiency then is measured at 1/3 or 33.33%.例如,如果转换为基数为 10,则 1 字节范围为 0 - 255 的 Unit8Array 的源需要 3 个字节。在此示例中,效率测量为 1/3 或 33.33%。 Then a source of 2 bytes is examined, which has a range of 0 - 65535 requiring 5 bytes of base 10, for an efficiency of 2/5 or 40%.然后检查 2 个字节的源,其范围为 0 - 65535,需要 5 个以 10 为底的字节,效率为 2/5 或 40%。 So a source chunk size of 2 bytes when converting from base 256 to base 10 is more efficient than a chunk size of 1 byte.因此,从基数 256 转换为基数 10 时,2 字节的源块大小比 1 字节的块大小更有效。 And so on, until the best ratio is found that is less than or equal to the maximum source chunk size.以此类推,直到找到小于或等于最大源块大小的最佳比率。

The code below dumps the evaluation of mostEfficientChunk() to make the determination of the best chunk size readily apparent.下面的代码转储了mostEfficientChunk()的评估,以使最佳块大小的确定变得显而易见。

Then, once the chunk size is set, the source data is fed to 'code()' which queues up the source, and then if sufficient data exists to form a chunk, the function converts the chunk to the target base.然后,一旦设置了块大小,源数据被馈送到将源排队的“code()”,然后如果存在足够的数据来形成一个块,function 将块转换为目标库。 Note that code() can be called continuously if the source is streaming.请注意,如果源是流式传输,则可以连续调用code() When the stream is finished, flush() must be called which appends digits that represent 0 until a chunk size is met, and then produces the final target chunk.当 stream 完成时,必须调用flush()来附加代表0的数字,直到满足块大小,然后生成最终目标块。 Note that this last chunk is padded, so one will have to track the length of the original source to trim the decoding appropriately.请注意,最后一块是填充的,因此必须跟踪原始源的长度以适当地修剪解码。

There are some comments and test cases in the code to help in understanding how the Encoder class operates.代码中有一些注释和测试用例,有助于理解编码器 class 是如何工作的。

 class EncodeStream { constructor( fromBase, toBase, encode = 'encode', maxChunkSize = 32 ) { console.assert( typeof fromBase === 'string' || typeof fromBase === 'number' ); console.assert( typeof toBase === 'string' || typeof toBase === 'number' ); console.assert( encode === 'encode' || encode === 'decode' ); this.encode = encode; if ( typeof fromBase === 'string' ) { this.fromBase = fromBase.length; this.fromBaseDigits = fromBase; } else { this.fromBase = fromBase |0; this.fromBaseDigits = null; } console.assert( 2 <= this.fromBase && this.fromBase <= 2**32 ); if ( typeof toBase === 'string' ) { this.toBase = toBase.length; this.toBaseDigits = toBase; } else { this.toBase = toBase |0; this.toBaseDigits = null; } console.assert( 2 <= this.toBase && this.toBase <= 2**32 ); if ( encode === 'encode' ) { this.chunking = this.mostEfficientChunk( this.fromBase, this.toBase, maxChunkSize ); } else { let temp = this.mostEfficientChunk( this.toBase, this.fromBase, maxChunkSize ); this.chunking = { bestSrcChunk: temp.bestTgtChunk, bestTgtChunk: temp.bestSrcChunk }; } console.log( `Best Source Chunk Size: ${this.chunking.bestSrcChunk}, Best Target Chunk Size: ${this.chunking.bestTgtChunk}` ); this.streamQueue = []; } code( stream ) { console.assert( typeof stream === 'string' || Array.isArray( stream ) ); if ( this.fromBaseDigits ) { this.streamQueue.push(...stream.split( '' ).map( digit => this.fromBaseDigits.indexOf( digit ) ) ); } else { this.streamQueue.push(...stream ); } let result = []; while ( this.chunking.bestSrcChunk <= this.streamQueue.length ) { // Convert the source chunk to a BigInt value. let chunk = this.streamQueue.splice( 0, this.chunking.bestSrcChunk ); let chunkValue = 0n; for ( let i = 0; i < chunk.length; i++ ) { chunkValue = chunkValue * BigInt( this.fromBase ) + BigInt( chunk[ i ] ); } // And now convert the BigInt value to a target chunk. let temp = new Array( this.chunking.bestTgtChunk - 1 ); for ( let i = 0; i < this.chunking.bestTgtChunk; i++ ) { temp[ this.chunking.bestTgtChunk - 1 - i ] = chunkValue % BigInt( this.toBase ); chunkValue = chunkValue / BigInt( this.toBase ); } result.push(...temp ); } // Finally, if the target base is represented by a string of digits, then map // the resulting array to the target digits. if ( this.toBaseDigits ) { result = result.map( digit => this.toBaseDigits[ digit ] ).join( '' ); } return result; } flush() { // Simply add zero digits to the stream until we have a complete chunk. if ( 0 < this.streamQueue.length ) { while ( this.streamQueue.length < this.chunking.bestSrcChunk ) { if ( this.fromBaseDigits ) { this.streamQueue.push( this.fromBaseDigits[ 0 ] ); } else { this.streamQueue.push( 0 ); } } } return this.code( this.fromBaseDigits? '': [] ); } mostEfficientChunk( sourceBase, targetBase, maxChunkSize ) { console.assert( 2 <= sourceBase && sourceBase <= 2 ** 32 ); console.assert( 2 <= targetBase && targetBase <= 2 ** 32 ); console.assert( 1 <= maxChunkSize && maxChunkSize <= 64 ); // Since BigInt does not have a LOG function, let's just brute force // determine the maximum number of target digits per chunk size of // source digits... let sBase = BigInt( sourceBase ); let tBase = BigInt( targetBase ); let mSize = BigInt( maxChunkSize ); let efficiency = 0; let result = { bestSrcChunk: 0, bestTgtChunk: 0 }; for ( let chunkSize = 1n; chunkSize <= mSize; chunkSize++ ) { let maxSrcValue = sBase ** chunkSize - 1n; let maxSrcBits = maxSrcValue.toString( 2 ).length; let d = 0n; let msv = maxSrcValue; while ( 0n < msv ) { msv = msv / tBase; d++; } if ( this.encode === 'encode' ) { console.log( `Source Chunk Size: ${chunkSize}, Max Source Value: ${maxSrcValue}\nTarget Chunk Size: ${d}, Max Target Value: ${tBase**d-1n}, Efficiency: ${Number( chunkSize * 10000n / d ) / 100}%` ); } if ( efficiency < Number( chunkSize ) / Number( d ) ) { efficiency = Number( chunkSize ) / Number( d ); result.bestSrcChunk = Number( chunkSize ); result.bestTgtChunk = Number( d ); } } return result; } } let source, toBase, encoder, encoderResult, decoder, decoderResult; source = [255,254,253,252,251]; toBase = '0123456789'; console.log( '\n\n' ); console.log( [ 'Encoding', source.join(','), `to base '${toBase}'` ] ); encoder = new EncodeStream( 256, toBase, 'encode', 2 ); encoderResult = ''; encoderResult += encoder.code( source ); encoderResult += encoder.flush(); console.log( `Encoded result: '${encoderResult}'` ); console.log( [ 'Decoding...' ] ); decoder = new EncodeStream( toBase, 256, 'decode', 2 ); decoderResult = ''; decoderResult += decoder.code( encoderResult ); decoderResult += decoder.flush(); console.log( `Decoded result: '${decoderResult}'` ); console.log( '\n\n' ); console.log( [ 'Encoding', source.join(','), `to base '${toBase}'` ] ); encoder = new EncodeStream( 256, toBase, 'encode', 16 ); encoderResult = ''; encoderResult += encoder.code( source ); encoderResult += encoder.flush(); console.log( `Encoded result: '${encoderResult}'` ); console.log( [ 'Decoding...' ] ); decoder = new EncodeStream( toBase, 256, 'decode', 16 ); decoderResult = ''; decoderResult += decoder.code( encoderResult ); decoderResult += decoder.flush(); console.log( `Decoded result: '${decoderResult}'` ); source = [255,254,253,252,251,250,249,248,247]; toBase = '256789bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ'; console.log( '\n\n' ); console.log( [ 'Encoding', source.join(','), `to base '${toBase}'` ] ); encoder = new EncodeStream( 256, toBase, 'encode', 16 ); encoderResult = ''; encoderResult += encoder.code( source ); encoderResult += encoder.flush(); console.log( `Encoded result: '${encoderResult}'` ); console.log( [ 'Decoding...' ] ); decoder = new EncodeStream( toBase, 256, 'decode', 16 ); decoderResult = ''; decoderResult += decoder.code( encoderResult ); decoderResult += decoder.flush(); console.log( `Decoded result: '${decoderResult}'` );

Note that it appears you will need to open up the browser debugger to see the full console log results.请注意,您似乎需要打开浏览器调试器才能查看完整的控制台日志结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM