简体   繁体   English

Python:将字母数字字符串可逆地编码为整数

[英]Python: Reversibly encode alphanumeric string to integer

I want to convert a string (composed of alphanumeric characters) into an integer and then convert this integer back into a string:我想将字符串(由字母数字字符组成)转换为整数,然后将此整数转换回字符串:

string --> int --> string

In other words, I want to represent an alphanumeric string by an integer.换句话说,我想用整数表示一个字母数字字符串。

I found a working solution, which I included in the answer, but I do not think it is the best solution, and I am interested in other ideas/methods.我找到了一个可行的解决方案,我将其包含在答案中,但我认为这不是最佳解决方案,而且我对其他想法/方法感兴趣。

Please don't tag this as duplicate just because a lot of similar questions already exist, I specifically want an easy way of transforming a string into an integer and vice versa .请不要仅仅因为已经存在很多类似的问题而将其标记为重复,我特别想要一种将字符串转换为整数的简单方法,反之亦然

This should work for strings that contain alphanumeric characters, ie strings containing numbers and letters.这应该适用于包含字母数字字符的字符串,即包含数字和字母的字符串。

Here's what I have so far:这是我到目前为止所拥有的:

string --> bytes字符串 --> 字节

mBytes = m.encode("utf-8")

bytes --> int字节 --> 整数

mInt = int.from_bytes(mBytes, byteorder="big")

int --> bytes int --> 字节

mBytes = mInt.to_bytes(((mInt.bit_length() + 7) // 8), byteorder="big")

bytes --> string字节 --> 字符串

m = mBytes.decode("utf-8")

try it out:试试看:

m = "test123"
mBytes = m.encode("utf-8")
mInt = int.from_bytes(mBytes, byteorder="big")
mBytes2 = mInt.to_bytes(((mInt.bit_length() + 7) // 8), byteorder="big")
m2 = mBytes2.decode("utf-8")
print(m == m2)

Here is an identical reusable version of the above:这是上述内容的相同可重用版本:

class BytesIntEncoder:

    @staticmethod
    def encode(b: bytes) -> int:
        return int.from_bytes(b, byteorder='big')

    @staticmethod
    def decode(i: int) -> bytes:
        return i.to_bytes(((i.bit_length() + 7) // 8), byteorder='big')

If you're using Python <3.6, remove the optional type annotations.如果您使用的是 Python <3.6,请删除可选的类型注释。

Test:测试:

>>> s = 'Test123'
>>> b = s.encode()
>>> b
b'Test123'

>>> BytesIntEncoder.encode(b)
23755444588720691
>>> BytesIntEncoder.decode(_)
b'Test123'
>>> _.decode()
'Test123'

Recall that a string can be encoded to bytes, which can then be encoded to an integer.回想一下,字符串可以编码为字节,然后可以编码为整数。 The encodings can then be reversed to get the bytes followed by the original string.然后可以反转编码以获取字节后跟原始字符串。

This encoder uses binascii to produce an identical integer encoding to the one in the answer by charel-f.此编码器使用binascii生成charel-f 的答案中相同的整数编码 I believe it to be identical because I extensively tested it.我相信它是相同的,因为我对其进行了广泛的测试。

Credit: this answer .信用:这个答案

from binascii import hexlify, unhexlify

class BytesIntEncoder:

    @staticmethod
    def encode(b: bytes) -> int:
        return int(hexlify(b), 16) if b != b'' else 0

    @staticmethod
    def decode(i: int) -> int:
        return unhexlify('%x' % i) if i != 0 else b''

If you're using Python <3.6, remove the optional type annotations.如果您使用的是 Python <3.6,请删除可选的类型注释。

Quick test:快速测试:

>>> s = 'Test123'
>>> b = s.encode()
>>> b
b'Test123'

>>> BytesIntEncoder.encode(b)
23755444588720691
>>> BytesIntEncoder.decode(_)
b'Test123'
>>> _.decode()
'Test123'

Assuming the character set is merely alphanumeric, ie az AZ 0-9, this requires 6 bits per character.假设字符集只是字母数字,即 az AZ 0-9,这需要每个字符 6 位。 As such, using an 8-bit byte-encoding is theoretically an inefficient use of memory.因此,使用 8 位字节编码在理论上是对内存的低效使用。

This answer converts the input bytes into a sequence of 6-bit integers.此答案将输入字节转换为 6 位整数序列。 It encodes these small integers into one large integer using bitwise operations.它使用按位运算将这些小整数编码为一个大整数。 Whether this actually translates into real-world storage efficiency is measured by sys.getsizeof , and is more likely for larger strings.这是否真的转化为现实世界的存储效率由sys.getsizeof ,并且更有可能用于更大的字符串。

This implementation customizes the encoding for the choice of character set.此实现自定义了字符集选择的编码。 If for example you were working with just string.ascii_lowercase (5 bits) rather than string.ascii_uppercase + string.digits (6 bits), the encoding would be correspondingly efficient.例如,如果您只使用string.ascii_lowercase (5 位)而不是string.ascii_uppercase + string.digits (6 位),则编码将相应地高效。

Unit tests are also included.单元测试也包括在内。

import string


class BytesIntEncoder:

    def __init__(self, chars: bytes = (string.ascii_letters + string.digits).encode()):
        num_chars = len(chars)
        translation = ''.join(chr(i) for i in range(1, num_chars + 1)).encode()
        self._translation_table = bytes.maketrans(chars, translation)
        self._reverse_translation_table = bytes.maketrans(translation, chars)
        self._num_bits_per_char = (num_chars + 1).bit_length()

    def encode(self, chars: bytes) -> int:
        num_bits_per_char = self._num_bits_per_char
        output, bit_idx = 0, 0
        for chr_idx in chars.translate(self._translation_table):
            output |= (chr_idx << bit_idx)
            bit_idx += num_bits_per_char
        return output

    def decode(self, i: int) -> bytes:
        maxint = (2 ** self._num_bits_per_char) - 1
        output = bytes(((i >> offset) & maxint) for offset in range(0, i.bit_length(), self._num_bits_per_char))
        return output.translate(self._reverse_translation_table)


# Test
import itertools
import random
import unittest


class TestBytesIntEncoder(unittest.TestCase):

    chars = string.ascii_letters + string.digits
    encoder = BytesIntEncoder(chars.encode())

    def _test_encoding(self, b_in: bytes):
        i = self.encoder.encode(b_in)
        self.assertIsInstance(i, int)
        b_out = self.encoder.decode(i)
        self.assertIsInstance(b_out, bytes)
        self.assertEqual(b_in, b_out)
        # print(b_in, i)

    def test_thoroughly_with_small_str(self):
        for s_len in range(4):
            for s in itertools.combinations_with_replacement(self.chars, s_len):
                s = ''.join(s)
                b_in = s.encode()
                self._test_encoding(b_in)

    def test_randomly_with_large_str(self):
        for s_len in range(256):
            num_samples = {s_len <= 16: 2 ** s_len,
                           16 < s_len <= 32: s_len ** 2,
                           s_len > 32: s_len * 2,
                           s_len > 64: s_len,
                           s_len > 128: 2}[True]
            # print(s_len, num_samples)
            for _ in range(num_samples):
                b_in = ''.join(random.choices(self.chars, k=s_len)).encode()
                self._test_encoding(b_in)


if __name__ == '__main__':
    unittest.main()

Usage example:用法示例:

>>> encoder = BytesIntEncoder()
>>> s = 'Test123'
>>> b = s.encode()
>>> b
b'Test123'

>>> encoder.encode(b)
3908257788270
>>> encoder.decode(_)
b'Test123'

so I needed transfer a dictionary in terms of numbers, it may look kinda ugly but it's efficient in the way that every char (english letters) is exactly 2 numbers but it's capable of transfering any kind of unicode char所以我需要在数字方面传输字典,它可能看起来有点难看,但它的效率很高,因为每个字符(英文字母)正好是 2 个数字,但它能够传输任何类型的 unicode 字符

import json

myDict = {
    "le key": "le Valueue",
    2 : {
        "heya": 1234569,
        "3": 4
    },
    'Α α, Β β, Γ γ' : 'שלום'
}
def convertDictToNum(toBeConverted):
    return int(''.join([(lambda c: c if len(c) ==2 else '0'+c )(str(ord(c) - 26)) for c in str(json.dumps(toBeConverted))]))

def loadDictFromNum(toBeDecoded):
    toBeDecoded = str(toBeDecoded)
    return json.loads(''.join([chr(int(toBeDecoded[cut:cut + 2]) + 26) for cut in range(0, len(toBeDecoded), 2)]))

numbersDict = convertDictToNum(myDict)
print(numbersDict)
# 9708827506817595083206088....
recoveredDict = loadDictFromNum(numbersDict)
print(recoveredDict)
# {'le key': 'le Valueue', '2': {'heya': 1234569, '3': 4}, 'Α α, Β β, Γ γ': 'שלום'}

I wanted to do the same and came up with my own algorithm.我想做同样的事情并想出了我自己的算法。 You can decide it is worth using.您可以决定它是否值得使用。 You will always get the same int for the same provided input.对于提供的相同输入,您将始终获得相同的 int。 If you provide string with the same characters but in different order you will get different result.如果您提供具有相同字符但顺序不同的字符串,您将得到不同的结果。

import hashlib

string="example"
str_as_sha1hash = hashlib.sha1(string.encode()).hexdigest()
result = 0
for idx, char in enumerate(str_as_sha1hash):
    result = result + ord(char) * (idx + 1)
print(result)

You should get 61071 for word ' example '.对于单词“ example ”,您应该得到61071 If you try with ' exampel ' you should receive 55095 .如果您尝试使用“示例”,您应该会收到55095 If you think you need stronger hashing algorithm than sha-1 you can replace it with anything available in hashlib library.如果您认为您需要比 sha-1 更强的散列算法,您可以将其替换为hashlib库中可用的任何内容。 In the end if you need str instead of int you can do of course do str(result).最后,如果你需要 str 而不是 int 你当然可以做 str(result)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM