[英]encode hash in utf-8
I want to substitude a substring with a hash - said substring contains non-ascii caracters, so I tried to encode it to UTF-8.我想用 hash 替换 substring - 说 substring 包含非 ascii 字符,所以我尝试将它编码为 UTF-8。
result = re.sub(r'(Start:\s*)([^:]+)(:\s*)([^:]+)', lambda m: m.group(1) + m.group(2) + m.group(3) + hashlib.sha512(m.group(4).encode()).hexdigest(), line.encode('utf-8'))
I am not realy sure why this doesn't work, I thought with line.encode('utf-8'), the whole string is getting encoded.我不太确定为什么这不起作用,我想用 line.encode('utf-8'),整个字符串都被编码了。 I also tried to encode my m.groups to UTF-8, but I got the same UnicodeDecodeError.
我还尝试将我的 m.groups 编码为 UTF-8,但我得到了相同的 UnicodeDecodeError。
[unicodedecodeerror: 'ascii' codec can't decode byte in position ordinal not in range(128)]
[unicodedecodeerror: 'ascii' 编解码器无法解码 position 序号不在范围 (128) 中的字节]
Sample input:示例输入:
Start: myUsername: myÜsername:
What am I missing?我错过了什么?
EDIT_编辑_
Traceback (most recent call last):
File "C:/Users/Peter/Desktop/coding/filter.py", line 26, in <module>
encodeline = line.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 112: ordinal not in range(128)
Based on your symptoms, you're running on Python 2. Calling encode
on a Python 2 str
is almost always nonsensical. 根据您的症状,您正在运行Python2。在Python 2
str
上调用encode
几乎总是荒谬的。
You have two problems; 您有两个问题; one you're hitting now, and one you'll hit if you fix your current code.
一种是您现在要点击的,另一种是如果您修复了当前代码后就会点击的。
Your first problem is line
is already a str
in (apparently) UTF-8 encoded bytes , not unicode
, so encode
ing it implicitly decodes with Python's default encoding (ASCII; this isn't locale specific to my knowledge, and it's a rare Python 2 install that uses anything else), then re-encodes with the specified codec (or the default if not specified). 您的第一个问题是,
line
已经是(显然)UTF-8编码字节的str
,而不是unicode
,因此对它encode
会使用Python的默认编码(ASCII;隐式解码)进行解码 ,这不是我所知的语言环境,这是一种罕见的Python 2使用其他安装程序的安装), 然后使用指定的编解码器(或未指定的默认编解码器)重新编码。 Basically, line
was already UTF-8 encoded, you told it to encode again as UTF-8, but that's nonsensical, so Python tried to decode
as ASCII first, and failed before it even tried to encode
as you instructed. 基本上,
line
已经是UTF-8编码的,您让它再次编码为UTF-8,但这是没有意义的,因此Python尝试先将其decode
为ASCII,然后在尝试按照您的指示进行encode
之前就失败了。
The solution to this problem is to just not encode
line
at all ; 解决这个问题的办法就是根本不对
line
encode
; it's already UTF-8 encoded, so you're already golden. 它已经是UTF-8编码的,所以您已经很精通了。
Your second problem (which you haven't encountered yet, but you will) is that you're calling encode
on the group(4)
result. 您的第二个问题(您尚未遇到,但您会遇到)是在
group(4)
结果上调用encode
。 But of course, since the input was a str
, the group is a str
too, and you'll encounter the same problem trying to encode
a str
; 但是当然,由于输入是一个
str
,所以组也是一个str
,尝试encode
str
也会遇到相同的问题; since the group came from raw UTF-8 encoded bytes, the non-ASCII parts of it cause a UnicodeDecodeError
during the implicit decode step before the encode. 由于该组来自原始UTF-8编码字节,因此它的非ASCII部分在编码之前的隐式解码步骤中会导致
UnicodeDecodeError
。
The reason: 原因:
import sys
reload(sys)
sys.setdefaultencoding('UTF8')
works is that it (dangerously) changes the implicit decode step to use UTF-8, so all your encode
calls now perform the implicit decode
with UTF-8 instead of ASCII; 它的工作方式是(危险地)将隐式解码步骤更改为使用UTF-8,因此您现在所有的
encode
调用都使用UTF-8而不是ASCII进行隐式decode
; the decode
and encode
is mostly pointless, since all it does is return the original str
after confirming it's legal UTF-8 by means of decode
ing it as such, and otherwise acting as an expensive no-op. decode
和encode
基本上是没有意义的,因为它所做的全部工作就是通过对原来的str
进行decode
来确认它是合法的UTF-8,然后返回原来的str
,否则就充当了昂贵的no-op。
To fix the second problem, just change: 要解决第二个问题,只需更改:
m.group(4).encode()
to: 至:
m.group(4)
That leaves your final code as: 剩下的最终代码为:
result = re.sub(r'(Start:\s*)([^:]+)(:\s*)([^:]+)',
lambda m: m.group(1) + m.group(2) + m.group(3) + hashlib.sha512(m.group(4)).hexdigest(),
line)
Optionally, if you want to confirm your expectation that line
is in fact UTF-8 encoded bytes already, add the following above that re.sub
line: (可选)如果您希望确认该
line
实际上已经是UTF-8编码字节,则在 re.sub
行上方添加以下内容:
try:
line.decode('utf-8')
except Exception as e:
sys.exit("line (of type {!r}) not decodable as UTF-8: {}".format(line.__class__.__name__, e))
which will cause the program to exit immediately if the data given is not legal UTF-8 (and will also let you know what type line
is, so you can confirm for sure if it's really str
or unicode
, since str
implies you chose the wrong codec, while unicode
means your inputs aren't of the expected type). 如果给定的数据不是合法的UTF-8,这将导致程序立即退出(并且还将让您知道类型
line
是什么,因此您可以确定它是真正的str
还是unicode
,因为str
暗示您选择了错误的行)编解码器,而unicode
表示您的输入不是预期的类型)。
I found .. in my eyes a workaround. 我发现..在我眼中是一种解决方法。 Doesn't feel right though, but it does the job.
虽然感觉不对,但确实可以。
import sys
reload(sys)
sys.setdefaultencoding('UTF8')
I thought it could be done with .encode('utf-8') 我认为可以用.encode('utf-8')完成
file = 'xyz'
res = hashlib.sha224(str(file).encode('utf-8)).hexdigest()
Because of unicode object must be encode as string before hash.因为 unicode object 必须在 hash 之前编码为字符串。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.