简体   繁体   English

Python3 str(),字节和unicode

[英]Python3 str(), bytes, and unicode

I'm having trouble with the TYPES involved with this piece of code I wrote. 我在编写的这段代码所涉及的类型方面遇到了麻烦。 Ideally I wouldn't pay any mind to encoding types, but sometimes you're forced. 理想情况下,我不介意编码类型,但有时您会被迫。

So this is all centered around a directory walk of an NTFS FS on Windows. 因此,所有这些都围绕Windows上NTFS FS的目录路径进行。 Certain characters in file names (unicode, it seems) couldn't be written out to files or printed to the standard windows terminal (yes, I tried "chcp 65001" to print, which didn't work, but I need to write to a standard plain text file anyway) 文件名中的某些字符(似乎是unicode)无法写出到文件或无法打印到标准Windows终端机(是的,我尝试使用“ chcp 65001”进行打印,但不起作用,但是我需要写到一个标准的纯文本文件)

So I do the following. 因此,我执行以下操作。 As I understand it Python3 (I'm using 3.2.2) is unicode, so str() objects (and all supporting libs) are unicode, so I did this: 据我了解,Python3(我正在使用3.2.2)是unicode,所以str()对象(以及所有支持的libs)是unicode,所以我这样做了:

absfilepath = os.path.join(root, file).encode()

thinking utf-8 string would be returned and all is good with the world, but then I was getting errors about implicit type conversions to str() when I went to file write or stdout. 认为将返回utf-8字符串,并且一切都很好,但是当我去写文件或输出stdout时,我遇到了关于隐式类型转换为str()错误。 So I did the following: 所以我做了以下事情:

hashmap[checksum] = str(absfilepath)

(the hashmap is dumped later). (哈希图将在以后转储)。

thinking now it's in a native unicode Python3 string...but when I dump it into a file, using this: 现在想想它在本地unicode Python3字符串中...但是当我将其转储到文件中时,使用以下命令:

for key, val in m.items():
    f.write(key + "|" + val + "\n")

I still get this in the file: 我仍然在文件中得到这个:

e77bceb64d179377731a94186e56281c|b'K:\Filename'

which is indicative as a byte array. 指示为字节数组。

So what am I doing wrong here? 那我在做什么错呢? I'm sorry 'non-traditional' characters are in this directory tree, I'd rather them not be there, but they're there. 对不起,“非传统”字符在此目录树中,我希望它们不在那里,但它们在那里。 How do I store them (convert them?) into a manner that can be printed/written in normal plain text (ASCII?) and why is a byte array being returned from my hashmap where I'm clearly storing a standard string? 如何将它们存储(转换为它们?)为可以用普通纯文本(ASCII?)打印/书写的方式?为什么从我显然存储有标准字符串的哈希图中返回一个字节数组? Dealing with unicode has been a pretty horrific experience for me. 对我来说,处理unicode一直是非常恐怖的经历。

absfilepath = os.path.join(root, file).encode()

os.path.join() returns a string, str.encode() converts the string to a bytes object, so absfilepath contains a bytes object. os.path.join()返回一个字符串, str.encode()将字符串转换为字节对象,因此absfilepath包含一个字节对象。

hashmap[checksum] = str(absfilepath)

When you call str() on a bytes object, the bytes object is not decoded but instead a string representation is created: 当您对bytes对象调用str()时, 不会解码bytes对象,而是创建一个字符串表示形式:

>>> str(b'K:\Filename')
"b'K:\\\\Filename'"
>>> str(b'K:\Filename') == repr(b'K:\Filename')
True

So your dictionary now contains lots of "b'some-bytes-string'" strings. 因此,您的词典现在包含许多"b'some-bytes-string'"字符串。

The “fix” is simple: Just don't encode the strings you get from os.path.join . “修复”很简单:只是不要对从os.path.join获得的字符串进行编码。


If you get errors while writing the strings out to the file, then consider specifying an explicit encoding when opening the file in text mode: 如果在将字符串写到文件时遇到错误,请在以文本模式打开文件时考虑指定显式编码:

with open('some_file', 'w', encoding='utf-8') as f:
    …

Then Python will automatically write strings correctly. 然后,Python将自动正确地编写字符串。

Alternatively, to be completely safe, you can also open the file in binary mode and write the encoded strings instead: 另外,为完全安全起见,您还可以以二进制模式打开文件并编写编码后的字符串:

with open('some_file', 'bw') as f:
    value = key + "|" + val + "\n"
    f.write(value.encode()) # write a bytes object

But as long as you are within Python, you don't need to worry about special characters inside the string objects. 但是,只要您 Python中,就不必担心字符串对象中的特殊字符。 Python can handle them; Python可以处理它们; it's just the output devices that typically fail (eg printing to the console). 通常只是输出设备发生故障(例如,打印到控制台)。

You encoded your unicode string: 您编码了unicode字符串:

absfilepath = os.path.join(root, file).encode()
#                                      ^^^^^^^^

This produces a bytestring. 这将产生一个字节串。 Either don't encode, or when storing the paths in your hashmap decode again: 要么不编码,要么在将路径存储在hashmap再次解码

hashmap[checksum] = absfilepath.decode()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM