简体   繁体   English

如何在 python 中扩充这个 zlib 字节字符串?

[英]How can I inflate this zlib byte string in python?

I'm writing a tool to interact with a popular data warehouse SaaS.我正在编写一个与流行的数据仓库 SaaS 交互的工具。 Their online sql editor serializes sql worksheets to JSON, but the body of the SQL worksheet is zlib deflated using pako.js. Their online sql editor serializes sql worksheets to JSON, but the body of the SQL worksheet is zlib deflated using pako.js. I'm trying to read and inflate these zlib strings from python, but I can only decode bytestrings that contain short我正在尝试从 python 读取和扩充这些 zlib 字符串,但我只能解码包含短的字节串

An example with the sql text was the letter a : sql 文本的示例是字母a

bytestring = b'x\xef\xbf\xbdK\x04\x00\x00b\x00b\n'
zlib.decompress(bytestring[4:-4], -15).decode('utf-8')
>>> "a"

If I include a semicolon a;如果我包含分号a; , this fails to decompress: ,这无法解压缩:

bytestring = b'x\xef\xbf\xbdK\xef\xbf\xbd\x06\x00\x00\xef\xbf\xbd\x00\xef\xbf\xbd\n'
zlib.decompress(bytestring[4:-4], -15).decode('utf-8')
*** UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8f in position 1: invalid start byte

Note: I've also tried these examples decoding with 'punycode', which I have found references to in the javascript implementation.注意:我还尝试使用“punycode”解码这些示例,我在 javascript 实现中找到了参考。

My understanding of zlib is pretty limited, but I've picked up that the first two and last four bytes of a zlib string are headers/footers and can be trimmed if we run zlib with the magic number -15.我对 zlib 的理解非常有限,但我发现 zlib 字符串的前两个和最后四个字节是页眉/页脚,如果我们使用幻数 -15 运行 zlib 可以修剪。 It's entirely possible there is zlib magic number that would decompress these strings without needing to strip the header and footers, but I wasn't able to get any combination to work when looping from -64 to 64.完全有可能有 zlib 幻数可以解压缩这些字符串,而无需剥离 header 和页脚,但是当从 -64 循环到 64 时,我无法获得任何组合。

I've breakpointed my way through the online sql worksheet editor's save and load functions and found they are using the pako zlib library pako.deflate(a, {to: 'string'}) and pako.inflate(b['body'], {to: 'string'}) And I'm able to inflate/deflate sql strings in the browser using the pako library, but haven't been able to reproduce the same results in python.我已经通过在线 sql 工作表编辑器的保存和加载功能断点,发现它们正在使用 pako zlib 库pako.deflate(a, {to: 'string'})pako.inflate(b['body'], {to: 'string'})我可以使用pako库在浏览器中对 sql 字符串进行充气/放气,但无法在 python 中重现相同的结果。

I agree that this is a data corruption issue.我同意这是一个数据损坏问题。 zlib and pako should be able to read one another's data without any stripping fields off or adding magic numbers. zlibpako应该能够读取彼此的数据,而无需剥离任何字段或添加幻数。

To prove it, here are a couple of demo scripts I threw together, one using pako to deflate the data and one using zlib to inflate it again:为了证明这一点,这里有几个我放在一起的演示脚本,一个使用pako对数据进行压缩,一个使用zlib再次对其进行充气:

// deflate.js
var pako = require("./pako.js");
console.log(escape(pako.deflate(process.argv[2], {to: "string"})));
# inflate.py
import urllib.parse, zlib, sys
print(zlib.decompress(urllib.parse.unquote_to_bytes(sys.stdin.read())).decode("utf-8"))

Run them on the command line using node deflate.js "Here is some example text" | inflate.py使用node deflate.js "Here is some example text" | inflate.py node deflate.js "Here is some example text" | inflate.py . node deflate.js "Here is some example text" | inflate.py The expected output is the argument passed to node deflate.js .预期的 output 是传递给node deflate.js的参数。

One thing that is worth pointing out about pako is the behaviour when using the to: "string" option.关于pako值得指出的一件事是使用to: "string"选项时的行为。 The documentation for this option is as follows:此选项的文档如下:

to (String) - if equal to 'string', then result will be "binary string" (each char code [0..255]) to (String) - 如果等于'string',那么结果将是“二进制字符串”(每个字符代码 [0..255])

It is for this reason that I use escape in the JavaScript function above.正是出于这个原因,我在上面的 JavaScript function 中使用了escape Using escape ensures that the string passed between JavaScript and Python doesn't contain any non-ASCII characters.使用escape可确保在 JavaScript 和 Python 之间传递的字符串不包含任何非 ASCII 字符。 (Note that encodeURIComponent does not work because the string contains binary data.) I then use urllib.parse.unquote_to_bytes in Python to undo this escaping. (注意encodeURIComponent不起作用,因为字符串包含二进制数据。)然后我使用urllib.parse.unquote_to_bytes中的 urllib.parse.unquote_to_bytes 来撤消此 escaping。

If you can escape the pako -deflated data in the browser you could potentially pass that to Python to inflate it again.如果您可以在浏览器中escape pako -deflate 数据,您可能会将其传递给 Python 以再次对其进行充气。

Each sequence of \xef\xbf\xbd represents an instance of corruption of the original data. \xef\xbf\xbd的每个序列代表原始数据损坏的一个实例。

In your first example, the first and only \xef\xbf\xbd should be a single byte, which is the second byte of the zlib header.在您的第一个示例中,第一个也是唯一的\xef\xbf\xbd应该是单个字节,它是 zlib header 的第二个字节。 In the second example, the first \xef\xbf\xbd should be the second byte of the zlib header, the second instance should be \b4 , the third instance should be \ff , and the fourth instance should be \9b .在第二个例子中,第一个\xef\xbf\xbd应该是 zlib header 的第二个字节,第二个实例应该是\b4 ,第三个应该是\ff ,第四个应该是\9b

Somewhere along the way there is some UTF-8 processing that should not be happening .一路上的某个地方有一些不应该发生的 UTF-8 处理。 It's failing every time it comes across a byte with the high bit set.每次遇到设置了高位的字节时,它都会失败。 In those instances, it replaces the byte with that three-byte UTF-8 sequence U+FFFD , which is the "replacement" character used to represent an unknown character.在这些情况下,它将字节替换为三字节 UTF-8 序列U+FFFD ,这是用于表示未知字符的“替换”字符。

The bottom line is that your data is irretrievably corrupted.底线是您的数据已不可挽回地损坏。 You need to fix whatever is going on upstream from there.您需要从那里修复上游发生的任何事情。 Are you trying to use copy and paste to get the data?您是否尝试使用复制和粘贴来获取数据? If you see a question mark in a black diamond, it is that UTF-8 character.如果您在黑色菱形中看到问号,那就是 UTF-8 字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM