在 utf-8 中编码 hash

Question

I want to substitude a substring with a hash - said substring contains non-ascii caracters, so I tried to encode it to UTF-8.我想用 hash 替换 substring - 说 substring 包含非 ascii 字符，所以我尝试将它编码为 UTF-8。

result = re.sub(r'(Start:\s*)([^:]+)(:\s*)([^:]+)', lambda m: m.group(1) + m.group(2) + m.group(3) + hashlib.sha512(m.group(4).encode()).hexdigest(), line.encode('utf-8'))

I am not realy sure why this doesn't work, I thought with line.encode('utf-8'), the whole string is getting encoded.我不太确定为什么这不起作用，我想用 line.encode('utf-8')，整个字符串都被编码了。 I also tried to encode my m.groups to UTF-8, but I got the same UnicodeDecodeError.我还尝试将我的 m.groups 编码为 UTF-8，但我得到了相同的 UnicodeDecodeError。

[unicodedecodeerror: 'ascii' codec can't decode byte in position ordinal not in range(128)] [unicodedecodeerror: 'ascii' 编解码器无法解码 position 序号不在范围 (128) 中的字节]

Sample input:示例输入：

Start: myUsername: myÜsername:

What am I missing?我错过了什么？

EDIT_编辑_

Traceback (most recent call last):
  File "C:/Users/Peter/Desktop/coding/filter.py", line 26, in <module>
    encodeline = line.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 112: ordinal not in range(128)

Answer 1

Based on your symptoms, you're running on Python 2. Calling encode on a Python 2 str is almost always nonsensical. 根据您的症状，您正在运行Python2。在Python 2 str上调用encode几乎总是荒谬的。

You have two problems; 您有两个问题； one you're hitting now, and one you'll hit if you fix your current code. 一种是您现在要点击的，另一种是如果您修复了当前代码后就会点击的。

Your first problem is line is already a str in (apparently) UTF-8 encoded bytes , not unicode , so encode ing it implicitly decodes with Python's default encoding (ASCII; this isn't locale specific to my knowledge, and it's a rare Python 2 install that uses anything else), then re-encodes with the specified codec (or the default if not specified). 您的第一个问题是， line 已经是（显然）UTF-8编码字节的str ，而不是unicode ，因此对它encode会使用Python的默认编码（ASCII；隐式解码）进行解码 ，这不是我所知的语言环境，这是一种罕见的Python 2使用其他安装程序的安装），然后使用指定的编解码器（或未指定的默认编解码器）重新编码。 Basically, line was already UTF-8 encoded, you told it to encode again as UTF-8, but that's nonsensical, so Python tried to decode as ASCII first, and failed before it even tried to encode as you instructed. 基本上， line已经是UTF-8编码的，您让它再次编码为UTF-8，但这是没有意义的，因此Python尝试先将其decode为ASCII，然后在尝试按照您的指示进行encode之前就失败了。

The solution to this problem is to just not encode line at all ; 解决这个问题的办法就是根本不对line encode ; it's already UTF-8 encoded, so you're already golden. 它已经是UTF-8编码的，所以您已经很精通了。

Your second problem (which you haven't encountered yet, but you will) is that you're calling encode on the group(4) result. 您的第二个问题（您尚未遇到，但您会遇到）是在group(4)结果上调用encode 。 But of course, since the input was a str , the group is a str too, and you'll encounter the same problem trying to encode a str ; 但是当然，由于输入是一个str ，所以组也是一个str ，尝试encode str也会遇到相同的问题； since the group came from raw UTF-8 encoded bytes, the non-ASCII parts of it cause a UnicodeDecodeError during the implicit decode step before the encode. 由于该组来自原始UTF-8编码字节，因此它的非ASCII部分在编码之前的隐式解码步骤中会导致UnicodeDecodeError 。

The reason: 原因：

import sys

reload(sys)
sys.setdefaultencoding('UTF8')

works is that it (dangerously) changes the implicit decode step to use UTF-8, so all your encode calls now perform the implicit decode with UTF-8 instead of ASCII; 它的工作方式是（危险地）将隐式解码步骤更改为使用UTF-8，因此您现在所有的encode调用都使用UTF-8而不是ASCII进行隐式decode ； the decode and encode is mostly pointless, since all it does is return the original str after confirming it's legal UTF-8 by means of decode ing it as such, and otherwise acting as an expensive no-op. decode和encode基本上是没有意义的，因为它所做的全部工作就是通过对原来的str进行decode来确认它是合法的UTF-8，然后返回原来的str ，否则就充当了昂贵的no-op。

To fix the second problem, just change: 要解决第二个问题，只需更改：

m.group(4).encode()

to: 至：

m.group(4)

That leaves your final code as: 剩下的最终代码为：

result = re.sub(r'(Start:\s*)([^:]+)(:\s*)([^:]+)',
                lambda m: m.group(1) + m.group(2) + m.group(3) + hashlib.sha512(m.group(4)).hexdigest(),
                line)

Optionally, if you want to confirm your expectation that line is in fact UTF-8 encoded bytes already, add the following above that re.sub line: （可选）如果您希望确认该line实际上已经是UTF-8编码字节，则在 re.sub行上方添加以下内容：

try:
    line.decode('utf-8')
except Exception as e:
    sys.exit("line (of type {!r}) not decodable as UTF-8: {}".format(line.__class__.__name__, e))

which will cause the program to exit immediately if the data given is not legal UTF-8 (and will also let you know what type line is, so you can confirm for sure if it's really str or unicode , since str implies you chose the wrong codec, while unicode means your inputs aren't of the expected type). 如果给定的数据不是合法的UTF-8，这将导致程序立即退出（并且还将让您知道类型line是什么，因此您可以确定它是真正的str还是unicode ，因为str暗示您选择了错误的行）编解码器，而unicode表示您的输入不是预期的类型）。

Answer 2

I found .. in my eyes a workaround. 我发现..在我眼中是一种解决方法。 Doesn't feel right though, but it does the job. 虽然感觉不对，但确实可以。

import sys

reload(sys)
sys.setdefaultencoding('UTF8')

I thought it could be done with .encode('utf-8') 我认为可以用.encode（'utf-8'）完成

Answer 3

file = 'xyz'
res = hashlib.sha224(str(file).encode('utf-8)).hexdigest()

Because of unicode object must be encode as string before hash.因为 unicode object 必须在 hash 之前编码为字符串。

在 utf-8 中编码 hash

问题描述

3 个解决方案

解决方案1
1 已采纳 2018-10-26 13:26:58

解决方案2
0 2018-10-26 12:32:59

解决方案3
0 2022-02-16 05:27:43

在 utf-8 中编码 hash

问题描述

3 个解决方案

解决方案1 1 已采纳 2018-10-26 13:26:58

解决方案2 0 2018-10-26 12:32:59

解决方案3 0 2022-02-16 05:27:43

解决方案1
1 已采纳 2018-10-26 13:26:58

解决方案2
0 2018-10-26 12:32:59

解决方案3
0 2022-02-16 05:27:43