简体   繁体   English

为什么 rstrip 不能返回 Python 中的原始文本?

[英]Why rstrip cannot return raw text in Python?

I am trying to print a text in Spanish line by line using the following Python code:我正在尝试使用以下 Python 代码逐行打印西班牙语文本:

path = 'segismundo.txt'   #set the path file
f = open(path, encoding="utf-8")
lines = [x.rstrip() for x in open(path)]
print(lines)

The raw text is:原始文本是:

Sueña el rico en su riqueza,
que más cuidados le ofrece;

sueña el pobre que padece
su miseria y su pobreza;

However, the result is:然而,结果是:

['Sue帽a el rico en su riqueza,', 'que m谩s cuidados le ofrece;', '', 'sue帽a el pobreque 
padece', 'su miseria y su pobreza;', '']

My system language is Chinese(all the weird words '帽', '谩' are Chinese characters) so I am wondering whether it is because rstrip method can only execute English?我的系统语言是中文('帽子','骂'都是汉字)所以我想知道是不是因为rstrip方法只能执行英文?

Encoding and decoding is a finicky subject, especially because current software has to try to maintain compatibility with pre-Unicode software and files.编码和解码是一个棘手的问题,尤其是因为当前的软件必须设法保持与 Unicode 之前的软件和文件的兼容性。

So the text you list there is not raw in the sense that that is not stored in the file.所以你列出的文本不是原始的,因为它没有存储在文件中。 Files in most file systems contain bytes, and you have to know the encoding used for these files in some other ways.大多数文件系统中的文件都包含字节,您必须通过其他一些方式了解这些文件使用的编码。 To help with that, Python by default guesses the encoding used for opening files based on the locale settings.为此,Python 默认情况下会根据区域设置猜测用于打开文件的编码。 You can override that with the encoding argument to open , as you did on the line starting with f =... , but crucially not on the next line, where you open the same file again with the default encoding.您可以使用openencoding参数覆盖它,就像您在以f =...开头的行中所做的那样,但关键不是在下一行,您在下一行使用默认编码再次打开同一个文件。

print has a similar issue: it can write to a file, or the output can be printed on a terminal, or piped to another process with, but crucially all of those processes operate on sequences of raw bytes, and thus strings need to be encoded. print有一个类似的问题:它可以写入文件,或者 output 可以在终端上打印,或者通过管道传输到另一个进程,但至关重要的是,所有这些进程都对原始字节序列进行操作,因此需要对字符串进行编码.

So there is two potential mismatches in your code:所以你的代码中有两个潜在的不匹配:

  1. The file is encoded with UTF-8 but gets decoded using your system default which may not be UTF-8.该文件使用 UTF-8 编码,但使用您的系统默认值进行解码,这可能不是 UTF-8。
  2. The output gets encoded with your system default encoding but your terminal assumes it is some other encoding. output 使用您的系统默认编码进行编码,但您的终端假定它是其他编码。

Given the clues present in your question, my guess would be you simply need to change the line where you read the text to:鉴于您的问题中存在的线索,我的猜测是您只需要将阅读文本的行更改为:

lines = [x.rstrip() for x in f]

You also never close the file, which is usually not an issue, but something to keep in mind for larger applications: you don't want to keep files open when you don't have to.您也永远不会关闭文件,这通常不是问题,但对于较大的应用程序要记住:您不想在不需要时让文件保持打开状态。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM