简体繁体 English

html中base 64编码字符串的结构

[英]Structure of base 64 encoded strings in html

原文 2020-06-11 06:48:15 0 1 python/ selenium/ base64

I downloaded the page source (html) of websites with Selenium (Python).我用 Selenium (Python) 下载了网站的页面源代码 (html)。 And I wish to find all base 64 encoded strings in html files.我希望在 html 文件中找到所有 base 64 编码的字符串。

Is there a known structure to all base 64 encoded strings in htmls? html中所有base 64编码的字符串是否有已知的结构？ From my observations, it seems like it would start with ;base64 followed by hex-strings and finally a closing bracket ) .根据我的观察，它似乎以;base64开头，然后是十六进制字符串，最后是右括号) 。 Is that accurate?那准确吗？

From Wikipedia, the hex-string must also be composed of the followings: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ .从 Wikipedia 中，十六进制字符串还必须由以下内容组成： ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ 。 Can someone also confirm that?有人也可以确认吗？

Thanks a lot in advance!提前非常感谢！

* Edit 1 * * 编辑 1 *

Thanks a lot Tris, The link you provided is very helpful, However.非常感谢 Tris，您提供的链接非常有帮助，但是。 from that, it seems like there is no specific format for the end of a base 64 strings.由此看来，base 64 字符串的结尾似乎没有特定的格式。 If I want to detect its end, what advice would you give other than ) ?如果我想检测它的结束，除了)之外，你会给出什么建议？

I mainly want to track the changes of a bunch of websites, and the base64 encodings contain a lot of data that are not relevant for my use.我主要是想跟踪一堆网站的变化，而base64编码中包含很多与我使用无关的数据。 To save storage, I therefore intend to remove them.因此，为了节省存储空间，我打算删除它们。 An example is www.amd.com , which has the following data:image/png;base64,... (after being rendered by browser).一个例子是www.amd.com ，它有以下data:image/png;base64,... （被浏览器渲染后）。

Since there are many different websites, I don't know all of their formats.由于有许多不同的网站，我不知道它们的所有格式。 Here are some other examples of the base64 strings that I found and are not useful to me:以下是我发现但对我没有用的 base64 字符串的其他一些示例：

data:font/truetype;base64,AAEAAA...

data:image/png;base64,iVBORw0KG...

For several of the examples that I saw, they all ended with a closing bracket ) .对于我看到的几个示例，它们都以右括号)结尾。 May I ask then under what scenario would they end with ) and otherwise?请问在什么情况下他们会以)结束，否则？

Thanks again!再次感谢！

1 个解决方案

Not all base64-encoded strings will include a ;base64 at the beginning of them -- this is typically specific to data URLs .并非所有 base64 编码的字符串都会在开头包含;base64 - 这通常特定于数据 URL 。 If you are specifically looking for base64-encoded images or other inline elements that would otherwise be referred to with an HTTP URL, this might be fine.如果您专门寻找 base64 编码的图像或其他内联元素，否则这些元素将被 HTTP URL 引用，这可能没问题。 The closing bracket is not typically relevant, I haven't seen that required on data URLs or other base64-encoded strings.右括号通常不相关，我还没有看到数据 URL 或其他 base64 编码字符串所需的。

Typically, base64-encoded strings use the alphabet you've mentioned -- ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ .通常，base64 编码的字符串使用您提到的字母表—— ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ 。 If the encoded length is not a multiple of 3 bytes, it is padded with an appropriate number of = characters at the end.如果编码长度不是 3 个字节的倍数，则在末尾用适当数量的=字符填充。

There is another commonly used base64 format on the web -- the URL-safe base64 format. web 上还有另一种常用的 base64 格式—— URL-safe base64 格式。 In this encoding, + and / are typically replaced with - and _ so they can be included in URLs safely, hence the name.在这种编码中， +和/通常被替换为-和_ ，因此它们可以安全地包含在 URL 中，因此得名。

This information may be irrelevant if you know more about the structure of the websites you are trying to parse, aside from just "they contain base64-encoded string data."如果您了解更多关于您尝试解析的网站的结构，则此信息可能无关紧要，除了“它们包含 base64 编码的字符串数据”。