简体   繁体   English

Python:在HTML文件中解码base64编码的字符串,并将其替换为已解码的字符串

[英]Python: Decoding base64 encoded strings within an HTML file and replacing these strings with their decoded counterpart

Please help because this flipping program is my ongoing nightmare! 请帮忙,因为这个翻转程序是我持续的噩梦!

I have several files that include some base64 encoded strings. 我有几个文件,其中包含一些base64编码的字符串。 Part of one file for examples reads as follows: 例如,一个文件的一部分内容如下:

charset=utf-8;base64,I2JhY2tydW5uZXJfUV81c3R7aGVpZ2h0OjkzcHg7fWJhY2tydW5uZXJfUV81c3R7ZGlzcGxheTpibG9jayFpbXBvcnRhbnQ7fQ==" 

They are always in the format "ANYTHINGbase64,STRING" It is html but I am treating it as one large string and using BeautifulSoup elsewhere. 它们始终采用“ ANYTHINGbase64,STRING”格式。它是html,但我将其视为一个大字符串,并在其他地方使用BeautifulSoup。 I am using a regex expression 'base' to extract the base64 string, then using base64 module to decode this as per my defined function "debase". 我使用正则表达式'base'提取base64字符串,然后使用base64模块根据我定义的函数“ debase”对此进行解码。

This seems to work ok up to a point: the output of b64encode for some reason adds unnecessary stuff: 这似乎可以正常工作:由于某种原因,b64encode的输出会添加不必要的内容:

b'#backrunner_Q_5st{height:93px;}backrunner_Q_5st{display:block!important;}' with the string the stuff in the middle. b'#backrunner_Q_5st {height:93px;} backrunner_Q_5st {display:block!important;}',中间是字符串。

I'm guessing this means in bytes; 我猜这意味着以字节为单位; so I have tried getting my function to encode this as utf8 but basically I am out of my depth. 所以我试图让我的函数将其编码为utf8,但基本上我已经超出了深度。

The end result that I want is for all "base64,STRING" in my html to be decoded and replaced with DECODEDSTRING. 我想要的最终结果是将我html中的所有“ base64,STRING”解码并替换为DECODEDSTRING。

Please help! 请帮忙!

import os, sys, bs4, re, base64, codecs
from bs4 import BeautifulSoup

def debase(instr):
    outstring = base64.b64decode(instr)
    outstring = codecs.utf_8_encode(str(outstring))
    outstring.split("'")[1]
    return outstring

base = re.compile('base64,(.*?)"')

for eachArg in sys.argv[1:]:
    a=open(eachArg,'r',encoding='utf8')
    presoup = a.read()
    b = re.findall(base, presoup)
    for value in b:
        re.sub('base64,.*?"', debase(value))
        print(debase(value))


    soup=BeautifulSoup(presoup, 'lxml')
    bname= str(eachArg).split('.')[0]
    a.close()
    [s.extract() for s in soup('script')]
    os.remove(eachArg)
    b=open(bname +'.html','w',encoding='utf8')
    b.write(soup.prettify())
    b.close()

Your input is a bit oddly formatted (with a trailing unmatched single quote, for instance), so make sure you're not doing unnecessary work or parsing content in a weird way. 您的输入格式有些奇怪(例如,尾随不匹配的单引号),因此请确保您没有做不必要的工作或以怪异的方式解析内容。

Anyway, assuming you have your input in the form it's given, you have to decode it using base64 in the way you just did, then decode using the given encoding to get a string rather than a bytestring: 无论如何,假设您具有输入形式的输入,则必须以刚才的方式使用base64对其进行解码,然后使用给定的编码进行解码以获取字符串而不是字节字符串:

import base64

inp = 'charset=utf-8;base64,I2JhY2tydW5uZXJfUV81c3R7aGVpZ2h0OjkzcHg7fWJhY2tydW5uZXJfUV81c3R7ZGlzcGxheTpibG9jayFpbXBvcnRhbnQ7fQ=="'
head,tail = inp.split(';')
_,enc = head.split('=') # TODO: check if the beginning is "charset"
_,msg = tail.split(',') # TODO: check that the beginning is "base64"

plaintext_bytes = base64.b64decode(msg)
plaintext_str = plaintext_bytes.decode(enc)

Now the two results are 现在两个结果是

>>> plaintext_bytes
b'#backrunner_Q_5st{height:93px;}backrunner_Q_5st{display:block!important;}'
>>> plaintext_str
'#backrunner_Q_5st{height:93px;}backrunner_Q_5st{display:block!important;}'

As you can see, the content of the bytes was already readable, this is because the contents were ASCII. 如您所见,字节的内容已经可读,这是因为内容是ASCII。 Also note that I didn't remove the trailing quote from your string: base64 is smart enough to ignore what comes after the two equation signs in the content. 还要注意,我没有从字符串中删除尾随引号: base64足够聪明,可以忽略内容中两个等式符号之后的内容。


In a nutshell, strings are a somewhat abstract representation of text in python 3, and you need a specific encoding if you want to represent the text with a stream of ones and zeros (which you need when you transfer data from one place to another). 简而言之,字符串是python 3中某种程度上抽象的文本表示形式,如果要用一和零的流来表示文本,则需要特定的编码(将数据从一个位置传输到另一个位置时需要用到) 。 When you get a string in bytes, you have to know how it was encoded in order to decode it and obtain a proper string. 当您获得一个以字节为单位的字符串时,您必须知道它是如何编码的才能对其进行解码并获得正确的字符串。 If the string is ASCII-compatible then the encoding is fairly trivial, but once more general characters appear your code will break if you use the wrong encoding. 如果字符串是ASCII兼容的,则编码是相当琐碎的,但是如果您使用错误的编码,则再次出现一般字符时,您的代码将中断。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM