简体   繁体   English

使用python,如何压缩长查询字符串值?

[英]Using python, how can I compress a long query string value?

So I am generating a URL in python for a GET request (it has to be a GET request) and one of my query string parameters is EXTREMELY long (~900 chars) Is there any way I can compress this string and place it in the url? 因此,我正在python中为GET请求生成一个URL(必须是GET请求),而我的查询字符串参数之一是EXTREMELY long(〜900个字符),有什么办法可以压缩此字符串并将其放在网址? I have tried zlib but that uses bytes and the url needs to be a string. 我尝试过zlib但是它使用字节,并且url需要为字符串。 Basically is there any way to do this? 基本上有什么方法可以做到这一点?

# On server
x = '900_char_string'
compressed_string = compress(x)
return 'http://whatever?querystring_var=' + compressed_string
# ^ return value is what client requests by clicking link with that url or whatever
# On client
# GET http://whatever?querystring_var=randomcompressedchars<900
# Server receiving request
value = request['querystring_var']
y = decompress(value)
print(y)
>>> 900_char_string  # at this point server can work with the uncompressed string


The issue is now fairly clear. 现在,这个问题已经很清楚了。 I think we need to examine this from a standpoint of information theory. 我认为我们需要从信息论的角度对此进行研究。

  • The input is a string of visible characters, currently represented in 8 bits each. 输入是一串可见字符,当前每个字符以8位表示。
  • The "alphabet" for this string is alphanumeric (26+26+10 symbols), plus about 20 special and reserved characters, 80+ characters total. 该字符串的“字母”是字母数字(26 + 26 + 10个符号),加上大约20个特殊和保留字符,总共80+个字符。
  • There is no apparent redundancy in the generated string. 生成的字符串中没有明显的冗余。

There are three main avenues to shortening a representation, taking advantage of 利用以下三个主要途径可以缩短代表人数:

  • Frequency of characters (hamming): replace a frequent character with fewer than 8 bits; 字符频率(汉明):用少于8位代替常用字符; longer bit strings will then be needed for rare characters. 对于稀有字符,则需要更长的位字符串。
  • Frequency of substrings (compression): replace a frequent substring with a single character. 子字符串的频率(压缩):用单个字符替换频繁出现的子字符串。
  • Convert to a different base: ideally, len(alphabet). 转换为其他基数:理想的是len(alphabet)。

The first two methods can lengthen the resulting string, as they require starting with a translation table. 前两种方法可以加长结果字符串,因为它们需要从转换表开始。 Also, since your strings appear to be taken from a uniform random distribution, there will be no redundancy or commonality to leverage. 另外,由于您的字符串似乎取自统一的随机分布,因此不会产生冗余或通用性。 When the Shannon entropy is at or near the maximum over the input tokens, there is nothing to be gained in those methods. 当香农熵等于或接近输入令牌的最大值时,这些方法将无济于事。

This leaves us base conversion. 这使我们有了基础转换。 We're using 8 bits -- 256 combinations -- to represent an alphabet of only 82 characters. 我们使用8位(256个组合)表示仅82个字符的字母。 A simple base conversion will save about 20%; 一个简单的基本转换将节省大约20%; the ratio is log(82) / log(256). 比率是log(82)/ log(256)。 If you want a cheap conversion, simply map into a 7-bit representation, a saving of 12.5% 如果您想要便宜的转换,只需将其映射到7位表示中,可节省12.5%

Very simply, define a symbol ordinality on your character set, such as 很简单,在字符集上定义符号序号,例如

0123456789ABCDEFGH...YZabcd...yz:/?#[]()@!$%&'*+,;=%   (81 chars)

Now, compute the numerical equivalent of a given string, just as if you were hand-coding a conversion from a decimal or hex string. 现在,计算给定字符串的等效数值,就好像您是手工编码从十进制或十六进制字符串的转换一样。 The resulting large integer is the compressed value. 所得的大整数是压缩值。 Write it out in bytes, or chop it into 32-bit integers, or whatever fits your intermediate storage medium. 以字节为单位写出,或将其切成32位整数,或适合您的中间存储介质的任何形式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM