简体   繁体   English

将一系列1和0压缩为尽可能短的ascii字符串

[英]Compress a series of 1s and 0s into the shortest possible ascii string

How could you convert a series of 1 s and 0 s into the shortest possible form consisting of URL safe ascii characters? 你如何将一系列1秒和0秒转换为由URL安全ascii字符组成的最短形式?

eg. 例如。

s = '00100101000101111010101'
compress(s)

Resulting in something like: 导致类似于:

Ysi8aaU

And obviously: 显然:

decompress(compress(s)) == s

(I ask this question purely out of curiousity) (我纯粹出于好奇而问这个问题)

Here's the solution I came up with (+ far too many comments): 这是我提出的解决方案(+太多评论):

# A set of 64 characters, which allows a maximum chunk length of 6 .. because
# int('111111', 2) == 63 (plus zero)
charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-_'

def encode(bin_string):
    # Split the string of 1s and 0s into lengths of 6.
    chunks = [bin_string[i:i+6] for i in range(0, len(bin_string), 6)]
    # Store the length of the last chunk so that we can add that as the last bit
    # of data so that we know how much to pad the last chunk when decoding.
    last_chunk_length = len(chunks[-1])
    # Convert each chunk from binary into a decimal
    decimals = [int(chunk, 2) for chunk in chunks]
    # Add the length of our last chunk to our list of decimals.
    decimals.append(last_chunk_length)
    # Produce an ascii string by using each decimal as an index of our charset.
    ascii_string = ''.join([charset[i] for i in decimals])

    return ascii_string

def decode(ascii_string):
    # Convert each character to a decimal using its index in the charset.
    decimals = [charset.index(char) for char in ascii_string]
    # Take last decimal which is the final chunk length, and the second to last
    # decimal which is the final chunk, and keep them for later to be padded
    # appropriately and appended.
    last_chunk_length, last_decimal = decimals.pop(-1), decimals.pop(-1)
    # Take each decimal, convert it to a binary string (removing the 0b from the
    # beginning, and pad it to 6 digits long.
    bin_string = ''.join([bin(decimal)[2:].zfill(6) for decimal in decimals])
    # Add the last decimal converted to binary padded to the appropriate length
    bin_string += bin(last_decimal)[2:].zfill(last_chunk_length)

    return bin_string

So: 所以:

>>> bin_string = '000111000010101010101000101001110'
>>> encode(bin_string)
'hcQOPgd'
>>> decode(encode(bin_string))
'000111000010101010101000101001110'

And here it is in CoffeeScript: 这是在CoffeeScript中:

class Urlify
    constructor: ->
        @charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-_'

    encode: (bits) ->
        chunks = (bits[i...i+6] for i in [0...bits.length] by 6)
        last_chunk_length = chunks[chunks.length-1].length
        decimals = (parseInt(chunk, 2) for chunk in chunks)
        decimals.push(last_chunk_length)
        encoded = (@charset[i] for i in decimals).join('')

        return encoded

    decode: (encoded) ->
        decimals = (@charset.indexOf(char) for char in encoded)
        [last_chunk_length, last_decimal] = [decimals.pop(), decimals.pop()]
        decoded = (('00000'+d.toString(2)).slice(-6) for d in decimals).join('')
        last_chunk = ('00000'+last_decimal.toString(2)).slice(-last_chunk_length)
        decoded += last_chunk

        return decoded

As one of the comments mentioned, using base64 would probably be the way to go. 正如提到的评论之一,使用base64可能是要走的路。 However, you don't want to stick the binary in without some converting. 但是,您不希望在没有转换的情况下粘贴二进制文件。

Two options are converting to int first then packing: 两个选项首先转换为int然后打包:

import base64

s = '0110110'
n = int(s, 2)

result = base64.urlsafe_b64encode(str(n)).rstrip('=')

The other option would be to use the struct module to pack the value into a binary format and use this. 另一种选择是使用struct模块将值打包成二进制格式并使用它。 (The code below is from http://www.fuyun.org/2009/10/how-to-convert-an-integer-to-base64-in-python/ ) (以下代码来自http://www.fuyun.org/2009/10/how-to-convert-an-integer-to-base64-in-python/

import base64
import struct

def encode(n):
  data = struct.pack('<Q', n).rstrip('\x00')
  if len(data)==0:
    data = '\x00'
  s = base64.urlsafe_b64encode(data).rstrip('=')
  return s

def decode(s):
  data = base64.urlsafe_b64decode(s + '==')
  n = struct.unpack('<Q', data + '\x00'* (8-len(data)) )
  return n[0]

我会使用查找表将这些0和1中的8个转换为字节,然后使用base64对这些字节进行编码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM