简体   繁体   English

在 Python 中将文件作为字符串文字读取

[英]Read file as a string literal in Python

I am working with a CType function that requires a byte string that is being read in from a file.我正在使用需要从文件中读取的字节字符串的 CType 函数。 If I put the string into the script, it will work, as long as I denote the string as a string literal (ie with 'r') and then convert it to a byte string.如果我将字符串放入脚本中,它将起作用,只要我将字符串表示为字符串文字(即使用'r')然后将其转换为字节字符串。 But if I just read it in as a byte string it does not work.但是,如果我只是将它作为字节字符串读入,它就不起作用。 Is there a way to read in a file as a string literal?有没有办法将文件作为字符串文字读取?

if __name__ == '__main__':
    a = r"\x00hello"
    with open('some_file', 'rb') as f: # some file contains only "\x00hello"
        b = f.read()
    c = b"\x00hello"

    x = CtypeObj.Function(a.encode('utf-8', errors='ignore')) # success!
    y = CtypeObj.Function(b)                                  # failure!
    z = CtypeObj.Function(c)                                  # failure!

The line that you point at as a success likely isn't doing what you think it is doing either:您指出成功的那条线可能也没有做您认为它正在做的事情:

a = r"\x00hello"

That defines a string of 9 characters, \ , x , etc. Calling a.encode('utf-8', errors='ignore') takes that string and encodes the characters in the string using utf-8 and returns a bytes value of that encoding.这定义了一个由 9 个字符组成的字符串, \x等。调用a.encode('utf-8', errors='ignore')获取该字符串并使用 utf-8 对字符串中的字符进行编码并返回一个bytes值那个编码。 (which CtypeObj.Function() accepts) (其中CtypeObj.Function()接受)

I would assume that you don't really want that \00 part passed to the function?我会假设您真的不希望将\00部分传递给函数?

Reading from the 'rb' mode file gets you a bytes value as well, but the encoding of the file will be the encoding of that bytes value.'rb'模式文件中读取也会得到一个bytes值,但文件的编码将是该bytes值的编码。 If you need it to be utf-8 encoding (and the file might not be), then you should instead open the file as 'r' , read the value as a string, and encode with b.encode('utf-8') .如果您需要它是 utf-8 编码(并且文件可能不是),那么您应该将文件打开为'r' ,将值作为字符串读取,然后使用b.encode('utf-8') .

And finally this line:最后这一行:

c = b"\x00hello"

This just creates a length 6 bytes value, with the first byte being the 0 byte, and the rest the values for the 5 letters.这只是创建一个长度为 6 bytes的值,第一个字节为0字节,其余为 5 个字母的值。 However, that's not automatically a utf-8 encoding, and certainly not the same as you had before.但是,这不会自动成为 utf-8 编码,而且肯定与您之前的不同。 Again, it would seem you don't want that \x00 at the start, since it's very unusual for a string to start with a null character like that.同样,您似乎不希望\x00在开头,因为字符串以这样的空字符开头是非常不寻常的。

As indicated in the comments, r"\x00hello" and 'hello' are all string literals, but that's only meaningful in the context of code.如注释中所示, r"\x00hello"'hello'都是字符串文字,但这仅在代码上下文中有意义。 In terms of data, you only have strings of characters ( str ) and bytes values (sometimes called a string of bytes).就数据而言,您只有字符串 ( str ) 和bytes值(有时称为字节串)。 A "literal" is a way to write either in code directly: “文字”是一种直接用代码编写的方法:

s = 'hello'   # a string literal
b = b'hello'  # a bytes literal for the same text (under most encodings)

s == b.decode()  # True
b == s.encode()  # True

If you read a file using mode 'r' , you get strings.如果您使用模式'r'读取文件,则会得到字符串。 If you use a file using mode 'rb' , you get bytes.如果你使用模式'rb'使用文件,你会得到字节。

Try this:尝试这个:

if __name__ == '__main__':
    with open('./file.txt', 'rb') as f:
        # read `\x00hello` from file, remove trailing newline
        line = f.read().rstrip()
        # decode the unicode escapes, then re-encode
        line = line.decode('unicode-escape').encode('utf-8')

    print(line)
    print(b'\x00hello')

    print(line == b'\x00hello')

Adapted from advice from this answer .改编自此答案的建议。


[~] $ cat file.txt
\x00hello
[~] $ python script.py
b'\x00hello'
b'\x00hello'
True

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM