简体   繁体   中英

Structure of base 64 encoded strings in html

I downloaded the page source (html) of websites with Selenium (Python). And I wish to find all base 64 encoded strings in html files.

Is there a known structure to all base 64 encoded strings in htmls? From my observations, it seems like it would start with ;base64 followed by hex-strings and finally a closing bracket ) . Is that accurate?

From Wikipedia, the hex-string must also be composed of the followings: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ . Can someone also confirm that?

Thanks a lot in advance!

* Edit 1 *

Thanks a lot Tris, The link you provided is very helpful, However. from that, it seems like there is no specific format for the end of a base 64 strings. If I want to detect its end, what advice would you give other than ) ?

I mainly want to track the changes of a bunch of websites, and the base64 encodings contain a lot of data that are not relevant for my use. To save storage, I therefore intend to remove them. An example is www.amd.com , which has the following data:image/png;base64,... (after being rendered by browser).

Since there are many different websites, I don't know all of their formats. Here are some other examples of the base64 strings that I found and are not useful to me:

data:font/truetype;base64,AAEAAA...

...

For several of the examples that I saw, they all ended with a closing bracket ) . May I ask then under what scenario would they end with ) and otherwise?

Thanks again!

Not all base64-encoded strings will include a ;base64 at the beginning of them -- this is typically specific to data URLs . If you are specifically looking for base64-encoded images or other inline elements that would otherwise be referred to with an HTTP URL, this might be fine. The closing bracket is not typically relevant, I haven't seen that required on data URLs or other base64-encoded strings.

Typically, base64-encoded strings use the alphabet you've mentioned -- ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ . If the encoded length is not a multiple of 3 bytes, it is padded with an appropriate number of = characters at the end.

There is another commonly used base64 format on the web -- the URL-safe base64 format. In this encoding, + and / are typically replaced with - and _ so they can be included in URLs safely, hence the name.

This information may be irrelevant if you know more about the structure of the websites you are trying to parse, aside from just "they contain base64-encoded string data."

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM