简体   繁体   English

在 utf-8 中编码 hash

[英]encode hash in utf-8

I want to substitude a substring with a hash - said substring contains non-ascii caracters, so I tried to encode it to UTF-8.我想用 hash 替换 substring - 说 substring 包含非 ascii 字符,所以我尝试将它编码为 UTF-8。

result = re.sub(r'(Start:\s*)([^:]+)(:\s*)([^:]+)', lambda m: m.group(1) + m.group(2) + m.group(3) + hashlib.sha512(m.group(4).encode()).hexdigest(), line.encode('utf-8'))

I am not realy sure why this doesn't work, I thought with line.encode('utf-8'), the whole string is getting encoded.我不太确定为什么这不起作用,我想用 line.encode('utf-8'),整个字符串都被编码了。 I also tried to encode my m.groups to UTF-8, but I got the same UnicodeDecodeError.我还尝试将我的 m.groups 编码为 UTF-8,但我得到了相同的 UnicodeDecodeError。

[unicodedecodeerror: 'ascii' codec can't decode byte in position ordinal not in range(128)] [unicodedecodeerror: 'ascii' 编解码器无法解码 position 序号不在范围 (128) 中的字节]

Sample input:示例输入:

Start: myUsername: myÜsername:

What am I missing?我错过了什么?

EDIT_编辑_

Traceback (most recent call last):
  File "C:/Users/Peter/Desktop/coding/filter.py", line 26, in <module>
    encodeline = line.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 112: ordinal not in range(128)

Based on your symptoms, you're running on Python 2. Calling encode on a Python 2 str is almost always nonsensical. 根据您的症状,您正在运行Python2。在Python 2 str上调用encode几乎总是荒谬的。

You have two problems; 您有两个问题; one you're hitting now, and one you'll hit if you fix your current code. 一种是您现在要点击的,另一种是如果您修复了当前代码后就会点击的。

Your first problem is line is already a str in (apparently) UTF-8 encoded bytes , not unicode , so encode ing it implicitly decodes with Python's default encoding (ASCII; this isn't locale specific to my knowledge, and it's a rare Python 2 install that uses anything else), then re-encodes with the specified codec (or the default if not specified). 您的第一个问题是, line 已经是(显然)UTF-8编码字节的str ,而不是unicode ,因此对它encode会使用Python的默认编码(ASCII;隐式解码)进行解码 ,这不是我所知的语言环境,这是一种罕见的Python 2使用其他安装程序的安装), 然后使用指定的编解码器(或未指定的默认编解码器)重新编码。 Basically, line was already UTF-8 encoded, you told it to encode again as UTF-8, but that's nonsensical, so Python tried to decode as ASCII first, and failed before it even tried to encode as you instructed. 基本上, line已经是UTF-8编码的,您让它再次编码为UTF-8,但这是没有意义的,因此Python尝试先将其decode为ASCII,然后在尝试按照您的指示进行encode之前就失败了。

The solution to this problem is to just not encode line at all ; 解决这个问题的办法就是根本不对line encode ; it's already UTF-8 encoded, so you're already golden. 它已经是UTF-8编码的,所以您已经很精通了。

Your second problem (which you haven't encountered yet, but you will) is that you're calling encode on the group(4) result. 您的第二个问题(您尚未遇到,但您会遇到)是在group(4)结果上调用encode But of course, since the input was a str , the group is a str too, and you'll encounter the same problem trying to encode a str ; 但是当然,由于输入是一个str ,所以组也是一个str ,尝试encode str也会遇到相同的问题; since the group came from raw UTF-8 encoded bytes, the non-ASCII parts of it cause a UnicodeDecodeError during the implicit decode step before the encode. 由于该组来自原始UTF-8编码字节,因此它的非ASCII部分在编码之前的隐式解码步骤中会导致UnicodeDecodeError

The reason: 原因:

import sys

reload(sys)
sys.setdefaultencoding('UTF8')

works is that it (dangerously) changes the implicit decode step to use UTF-8, so all your encode calls now perform the implicit decode with UTF-8 instead of ASCII; 它的工作方式是(危险地)将隐式解码步骤更改为使用UTF-8,因此您现在所有的encode调用都使用UTF-8而不是ASCII进行隐式decode the decode and encode is mostly pointless, since all it does is return the original str after confirming it's legal UTF-8 by means of decode ing it as such, and otherwise acting as an expensive no-op. decodeencode基本上是没有意义的,因为它所做的全部工作就是通过对原来的str进行decode来确认它是合法的UTF-8,然后返回原来的str ,否则就充当了昂贵的no-op。

To fix the second problem, just change: 要解决第二个问题,只需更改:

m.group(4).encode()

to: 至:

m.group(4)

That leaves your final code as: 剩下的最终代码为:

result = re.sub(r'(Start:\s*)([^:]+)(:\s*)([^:]+)',
                lambda m: m.group(1) + m.group(2) + m.group(3) + hashlib.sha512(m.group(4)).hexdigest(),
                line)

Optionally, if you want to confirm your expectation that line is in fact UTF-8 encoded bytes already, add the following above that re.sub line: (可选)如果您希望确认该line实际上已经是UTF-8编码字节,则 re.sub上方添加以下内容:

try:
    line.decode('utf-8')
except Exception as e:
    sys.exit("line (of type {!r}) not decodable as UTF-8: {}".format(line.__class__.__name__, e))

which will cause the program to exit immediately if the data given is not legal UTF-8 (and will also let you know what type line is, so you can confirm for sure if it's really str or unicode , since str implies you chose the wrong codec, while unicode means your inputs aren't of the expected type). 如果给定的数据不是合法的UTF-8,这将导致程序立即退出(并且还将让您知道类型line是什么,因此您可以确定它是真正的str还是unicode ,因为str暗示您选择了错误的行)编解码器,而unicode表示您的输入不是预期的类型)。

I found .. in my eyes a workaround. 我发现..在我眼中是一种解决方法。 Doesn't feel right though, but it does the job. 虽然感觉不对,但确实可以。

import sys

reload(sys)
sys.setdefaultencoding('UTF8')

I thought it could be done with .encode('utf-8') 我认为可以用.encode('utf-8')完成

file = 'xyz'
res = hashlib.sha224(str(file).encode('utf-8)).hexdigest()

Because of unicode object must be encode as string before hash.因为 unicode object 必须在 hash 之前编码为字符串。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM