简体   繁体   English

如何在python3中使用base64解码带有特殊符号的文本?

[英]How to decode text with special symbols using base64 in python3?

I am trying to decode some list of texts using base64 module.我正在尝试使用 base64 模块解码一些文本列表。 Though I'm able to decode some, but probably the ones which have special symbols included in it I am unable to decode that.虽然我能够解码一些,但可能其中包含特殊符号的那些我无法解码。

import base64

# List of string which we are trying to decode
encoded_text_list = ['MTA0MDI0','MTA0MDYw','MTA0MDgz','MTA0MzI%3D']
    
# Iterating and decoding string using base64    
for k in encoded_text_list:
    print(k, base64.b64decode(k).decode())

Output:输出:

MTA0MDI0 104024
MTA0MDYw 104060
MTA0MDgz 104083

---------------------------------------------------------------------------
Error                                     Traceback (most recent call last)
<ipython-input-60-d1ba00f4e54a> in <module>
      2 for k in member_url_list:
      3     print(k)
----> 4     print(base64.b64decode(k).decode())
      5     # break

/usr/lib/python3.6/base64.py in b64decode(s, altchars, validate)
     85     if validate and not re.match(b'^[A-Za-z0-9+/]*={0,2}$', s):
     86         raise binascii.Error('Non-base64 digit found')
---> 87     return binascii.a2b_base64(s)
     88 
     89 

Error: Incorrect padding

The script works well but as it reaches to decode string 'MTA0MzI%3D' it gives the above error.该脚本运行良好,但在解码字符串“MTA0MzI%3D”时出现上述错误。

As above text list is based on url, so also tried with parse method of urllib.由于上面的文本列表是基于url的,所以也尝试了urllib的parse方法。

from urllib.parse import unquote
b64_string = 'MTA0MzI%3D'
b64_string = unquote(b64_string) # 'MTA0MzI=' 
b64_string += "=" * ((4 - len(b64_string) % 4) % 4)
print(base64.b64decode(b64_string).decode())

Output:输出:

10432

Expected Output:预期输出:

104327

Now the output may seems to be correct, but it isn't as it converts the input text from 'MTA0MzI%3D ' to 'MTA0MzI=' and so does it's output from '104327' to '10432' .现在输出似乎是正确的,但事实并非如此,因为它将输入文本从'MTA0MzI%3D ' 转换为'MTA0MzI='并且它的输出也是从'104327''10432' Thing is the above text with symbol works perfectly on this base64 site.事情是上面带有符号的文本在此base64站点上完美运行。

I have tried in different versions on python ie python 2, 3.6, 3.8, etc., I have also tried codecs module & explored some base64 functions, but got no positive response.我在 python 上尝试了不同的版本,即 python 2、3.6、3.8 等,我也尝试了编解码器模块并探索了一些 base64 函数,但没有得到积极的回应。 Can someone please help me to make it working or suggest any other way to get it done.有人可以帮助我让它工作或建议任何其他方法来完成它。

These are url-quoted strings, so url-unquoting is the correct procedure.这些是 url 引用的字符串,因此 url-unquoting 是正确的过程。 The first step is unquote them with urllib.parse.unquote .第一步是使用urllib.parse.unquote它们。 Only after that should you attempt base64-decoding and there's no need to manually mess around with the base64 padding character = .只有在那之后你才应该尝试 base64 解码并且没有必要手动弄乱 base64 填充字符=

The website you reference ignores invalid base64 characters and also infers the padding from the length of the base64-encoded data.您引用的网站会忽略无效的 base64 字符,还会根据 base64 编码数据的长度推断填充。 So you give the website MTA0MzI%3D and it throws away the % because it's not valid base64 char, then processes MTA0MzI3D and returns 104327. Base64 padding is redundant and I'm not sure why some base64 encoding standards specify to have it in there but many do.所以你给网站MTA0MzI%3D并且它丢弃了%因为它不是有效的 base64 字符,然后处理MTA0MzI3D并返回 104327。Base64 填充是多余的,我不确定为什么一些 base64 编码标准指定将它放在那里但是许多人这样做。

Example:例子:

import base64
import urllib.parse

# List of string which we are trying to decode
encoded_text_list = ['MTA0MDI0', 'MTA0MDYw', 'MTA0MDgz', 'MTA0MzI%3D']

# Iterating and decoding string using base64
for k in encoded_text_list:
    url_unquoted = urllib.parse.unquote(k)
    print(k, base64.b64decode(url_unquoted).decode('utf-8'))

Output输出

MTA0MDI0 104024
MTA0MDYw 104060
MTA0MDgz 104083
MTA0MzI%3D 10432

and 10432 is the correct output, not 104327. 10432 是正确的输出,而不是 104327。

The problem is % is not a valid base64 character.问题是%不是有效的 base64 字符。 The decoder expects an = sign there, and instead find %3D , which happens to be the URL encoding of = .解码器期望在那里有一个=符号,而是找到%3D ,它恰好是=的 URL 编码。 This likely means the value is url encoded somewhere upstream from your code.这可能意味着该值在您的代码上游某处进行了 url 编码。 Depending on requirements, you have some options:根据要求,您有一些选择:

  1. Call k = parse(k) ;调用k = parse(k) ; see builtin parse function查看内置解析函数
  2. Call k = k.replace('%3D', '=') to clean up this error调用k = k.replace('%3D', '=')清除这个错误
  3. Change the inputs to not be url encoded将输入更改为不进行 url 编码

Yes @President James K. Polk, I think you are right I have tried with reverse logic and it worked for me.是的@President James K. Polk,我认为你是对的我已经尝试过反向逻辑并且它对我有用。 We do need to unqote text first since it is url based text.我们确实需要先取消引用文本,因为它是基于 url 的文本。 Then it works perfectly fine.然后它工作得很好。

  1. First, we are trying to unquote & decode the text.首先,我们试图取消引用和解码文本。
import base64
from urllib.parse import unquote

# Input string to decode
url_b64_string = 'MTA0MzI%3D'

# Unquoting since it is url generated string
b64_string = unquote(url_b64_string) # 'MTA0MzI='

# (Optional)
# b64_string += "=" * ((4 - len(b64_string) % 4) % 4)

# Decode using base64
decode_text = base64.b64decode(b64_string).decode() # '10432'
print(decode_text)

Output:输出:

'10432'
  1. Then, we are trying to encode & quote the text.然后,我们尝试对文本进行编码和引用。
# Verifying output by encoding back

from urllib.parse import quote

# Input string to encode
decode_text = '10432'

# Encoding using base64 
encode_text = base64.b64encode(decode_text.encode()) # b'MTA0MzI=' or 'MTA0MzI=' using encode_text.decode()

# Quoting since it is url generated string
print(quote(encode_text)) # 'MTA0MzI%3D'

Output:输出:

MTA0MzI%3D

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM