Base64解码，直到没有Base64

Question

So my problem is something very simple, i think. 我想我的问题很简单。 I need to Decode Base64 until there is no Base64, i check with an RegEx if there is some Base64 but i got no Idea how to decode until there is no Base64. 我需要对Base64进行解码，直到没有Base64，我要与RegEx一起检查是否有一些Base64，但是直到没有Base64时，我才不知道如何解码。

In this short Code i can Decode the Base64 until there is no Base64 because my Text is defined. 在这段简短的代码中，由于可以定义我的文本，因此我可以对Base64进行解码，直到没有Base64。 (Until the Base64 Decode Stuff isn't "Hello World" decode) （直到Base64解码材料不是“ Hello World”解码）

# Import Libraries
from base64 import *
import re

# Text & Base64 String
strText = "Hello World"
strEncode = "VmxSQ2ExWXlUWGxUYTJoUVVqSlNXRlJYY0hOT1ZteHlXa1pLVVZWWE9EbERaejA5Q2c9PQo=".encode("utf-8")

# Decode
objRgx = re.search('^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$', strEncode.decode("utf-8"))

strDecode = b64decode(objRgx.group(0).encode("utf-8"))

print(strDecode.decode("utf-8"))

while strDecode != strText.encode("utf-8"):
    strDecode = b64decode(strDecode)

    print(strDecode.decode("utf-8"))

Does anyone have an Idea how i can decode the Base64 until there is the real text (no more base64) 有谁知道我如何解码Base64直到有真实文本（不再有base64）

PS sorry for my bad english. PS对不起，我的英语不好。

Answer 1

You can't, not in an arbitrary sense. 不能，不是任意的。 The problem is simply that normal, every day words can ALSO be BASE64. 问题很简单，每天正常的单词也可以是BASE64。 So, there's no real way to tell the difference between the two. 因此，没有真正的方法可以分辨两者之间的区别。

BASE64 doesn't have a terminator other than length. BASE64除了长度外没有终止符。 It CAN be terminated with = or == but does not HAVE to be terminated. 可以使用=或==终止，但不必终止。 The = are just padding. =只是填充。 No padding needed, then no =. 无需填充，则无需=。 So its possible that the BASE64 will end and some text will begin, without you being able to detect it. 因此，BASE64可能会终止而某些文本将开始，而您却无法检测到它。

Edit for "So there is really no way to do what i want?": 编辑为“所以真的没有办法做我想要的吗？”：

No, not deterministically, not reliably. 不，不是确定性的，不是可靠的。 Even with a heuristic, there will be potential cases where it fails and you will end up consuming too many characters, resulting in garbage at the end of your binary block, and lost of characters in the following text stream. 即使采用启发式方法，在某些情况下也可能失败，最终您将消耗过多的字符，从而导致二进制块末尾出现垃圾，并在以下文本流中丢失字符。

Now this is for an arbitrary BASE64 block. 现在这是用于任意BASE64块的。 If you KNOW what the binary data is, then perhaps there's hope. 如果您知道二进制数据是什么，那么也许就有希望了。

For example, if you KNOW what the binary data is, most binary formats "know" when they are "done". 例如，如果您知道什么是二进制数据，则大多数二进制格式在“完成”时都会“知道”。 I don't know of a valid binary format that says "read until you reach EOF". 我不知道说“直到达到EOF才读”的有效二进制格式。 They're typically laced with internal descriptors of "this is how much data the next chunk has" or with terminators saying "I'm done". 它们通常带有“这就是下一个块具有多少数据”的内部描述符，或者带有终止符“我完成了”。

In these cases you can treat the BASE64 as a stream. 在这些情况下，您可以将BASE64视为流。 BASE64 is basically pretty simple. BASE64基本上很简单。 It takes 3 bytes and converts them in to 4 characters. 它占用3个字节并将其转换为4个字符。

So, a B64 stream reader needs to simply read 4 chars and return the 3 bytes they represent. 因此，B64流读取器只需读取4个字符并返回它们表示的3个字节。

If you have, say, a PNG reader, it can start reading the converted stream. 例如，如果您有PNG阅读器，它可以开始阅读转换后的流。 And when it is "done", it "closes" the stream, and your original text is "at the end of the BASE64". 当“完成”时，它“关闭”流，并且原始文本位于“ BASE64的末尾”。

It can also work if you know the size of the original attachment. 如果您知道原始附件的大小，它也可以工作。 If someone sent "10,000 bytes", then you use your BASE64 stream decoder and simply read "10,000" bytes from it. 如果有人发送了“ 10,000字节”，那么您将使用BASE64流解码器并从中读取“ 10,000”字节。

More often than not, you will have BASE64 with a = or == terminator. 通常，您将拥有带有=或==终止符的BASE64。 It's the cases where you don't that it's a problem. 在某些情况下，您不认为这是一个问题。 The stream decoded works either way. 解码的流以任何一种方式工作。

If you don't know the original size of the attachment, or the format of the encoded binary, then you're pretty much out of luck. 如果您不知道附件的原始大小或编码的二进制文件的格式，那么您就很不走运了。

Answer 2

As a heuristic, you could compute the average word length in the result. 作为一种启发式方法，您可以计算结果中的平均单词长度。 Natural language will have short words like "As a heuristic, you could look at word length." 自然语言将包含一些简短的单词，例如“作为一种启发式方法，您可以查看单词的长度”。 A string that is still Base64 encoded will have few if any spaces and long strings between the spaces. 仍为Base64编码的字符串将几乎没有空格，并且空格之间有长字符串。

As another heuristic, you could calculate the proportions of vowels (a, e, i, o, u) to consonants or the number of capital letters in the middle of words. 作为另一种启发式方法，您可以计算元音（a，e，i，o，u）与辅音的比例或单词中间的大写字母数量。

Answer 3

So you're dealing with a block of data that may have been repeatedly base64-encoded? 因此，您要处理的数据块可能已被重复进行base64编码？ Why not just loop the string through b64decode() until it errors, then? 为什么不通过b64decode（）循环字符串，直到出现错误呢？

Also I think you probably don't need to sprinkle quite so many .encode("utf-8") around. 另外，我认为您可能不需要花太多的.encode("utf-8") 。

Answer 4

I see two valuable answers here referring to average word length (Mark Lutton) and byte-size of original data (Will Hartung). 我在这里看到两个有价值的答案，分别是平均字长（Mark Lutton）和原始数据的字节大小（Will Hartung）。 Another useful thing: look for dictionary words expected, meaningful numbers or/and dates. 另一个有用的东西：寻找期望的字典单词，有意义的数字或/和日期。

Base64解码，直到没有Base64

问题描述

4 个解决方案

解决方案1
6 2010-10-22 15:29:33

解决方案2
2 2010-10-22 15:27:06

解决方案3
0 2010-10-22 15:41:34

解决方案4
0

Base64解码，直到没有Base64

问题描述

4 个解决方案

解决方案1 6 2010-10-22 15:29:33

解决方案2 2 2010-10-22 15:27:06

解决方案3 0 2010-10-22 15:41:34

解决方案4 0

解决方案1
6 2010-10-22 15:29:33

解决方案2
2 2010-10-22 15:27:06

解决方案3
0 2010-10-22 15:41:34

解决方案4
0