[英]Don't convert newline when reading a file
I'm reading a text file: 我正在读一个文本文件:
f = open('data.txt')
data = f.read()
However newline in data
variable is normalized to LF ('\\n') while the file contains CRLF ('\\r\\n'). 但是,当文件包含CRLF('\\ r \\ n')时,
data
变量中的换行符被标准化为LF('\\ n')。
How can I instruct Python to read the file as is? 如何指示Python按原样读取文件?
In Python 2.x: 在Python 2.x中:
f = open('data.txt', 'rb')
The default is to use text mode, which may convert '\\n' characters to a platform-specific representation on writing and back on reading.
默认设置是使用文本模式,可以在写入时将“\\ n”字符转换为特定于平台的表示,并在读取时返回。 Thus, when opening a binary file, you should append
'b'
to the mode value to open the file in binary mode, which will improve portability.因此,在打开二进制文件时,您应该将
'b'
附加到模式值以在二进制模式下打开文件,这将提高可移植性。 (Appending'b'
is useful even on systems that don't treat binary and text files differently, where it serves as documentation.)(即使在不以不同方式处理二进制文件和文本文件的系统上,附加
'b'
也很有用,它可用作文档。)
In Python 3.x, there are three alternatives: 在Python 3.x中,有三种选择:
f1 = open('data.txt', 'rb')
This will leave newlines untransformed, but will also return bytes
instead of str
, which you will have to explicitly decode
to Unicode yourself. 这将使换行保持未转换状态,但也将返回
bytes
而不是str
,您必须自己显式decode
为Unicode。 (Of course the 2.x version also returned bytes that had to be decoded manually if you wanted Unicode, but in 2.x that's what a str
object is; in 3.x str
is Unicode.) (当然2.x版本还返回了必须手动解码的字节,如果你想要Unicode,但是在2.x中这就是
str
对象;在3.x str
是Unicode。)
f2 = open('data.txt', 'r', newline='')
This will return str
, and leave newlines untranslated. 这将返回
str
,并保留未翻译的换行符。 Unlike the 2.x equivalent, however, readline
and friends will treat '\\r\\n'
as a newline, instead of a regular character followed by a newline. 然而,与2.x等价物不同,
readline
和朋友会将'\\r\\n'
视为换行符,而不是常规字符后跟换行符。 Usually this won't matter, but if it does, keep it in mind. 通常这没关系,但如果确实如此,请记住。
f3 = open('data.txt', 'rb', encoding=locale.getpreferredencoding(False))
This treats newlines exactly the same way as the 2.x code, and returns str
using the same encoding you'd get if you just used all of the defaults… but it's no longer valid in current 3.x. 这与2.x代码完全一样处理换行符,并且如果你刚刚使用了所有默认值,则使用相同的编码返回
str
...但它在当前3.x中不再有效。
When reading input from the stream, if newline is None, universal newlines mode is enabled.
从流中读取输入时,如果换行为“无”,则启用通用换行模式。 Lines in the input can end in '\\n', '\\r', or '\\r\\n', and these are translated into '\\n' before being returned to the caller.
输入中的行可以以'\\ n','\\ r'或'\\ r \\ n'结尾,并且在返回给调用者之前将这些行转换为'\\ n'。 If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated.
如果是'',则启用通用换行模式,但行结尾将返回给未调换的调用者。
The reason you need to specify an explicit encoding for f3
is that opening a file in binary mode means the default changes from "decode with locale.getpreferredencoding(False)
" to "don't decode, and return raw bytes
instead of str
". 您需要为
f3
指定显式编码的原因是以二进制模式打开文件意味着默认从“使用locale.getpreferredencoding(False)
解码”更改为“不解码,并返回原始bytes
而不是str
”。 Again, from the docs : 再次,从文档 :
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
在文本模式下,如果未指定编码,则使用的编码与平台相关:调用locale.getpreferredencoding(False)以获取当前的语言环境编码。 (For reading and writing raw bytes use binary mode and leave encoding unspecified.)
(对于读取和写入原始字节,请使用二进制模式并保留未指定的编码。)
However: 然而:
'encoding' … should only be used in text mode.
'encoding'...只应在文本模式下使用。
And, at least as of 3.3, this is enforced; 并且,至少从3.3开始,这是强制执行的; if you try it with binary mode, you get
ValueError: binary mode doesn't take an encoding argument
. 如果你尝试二进制模式,你得到
ValueError: binary mode doesn't take an encoding argument
。
So, if you want to write code that works on both 2.x and 3.x, what do you use? 所以,如果你想编写适用于2.x和3.x的代码,你会用什么? If you want to deal in
bytes
, obviously f
and f1 are the same. But if you want to deal in
如果你想以
bytes
,显然f
和f1 are the same. But if you want to deal in
are the same. But if you want to deal in
str , as appropriate for each version, the simplest answer is to write different code for each, probably
f and
f2`, respectively. are the same. But if you want to deal in
, as appropriate for each version, the simplest answer is to write different code for each, probably
are the same. But if you want to deal in
str , as appropriate for each version, the simplest answer is to write different code for each, probably
f and
f2`。 If this comes up a lot, consider writing either wrapper function: 如果这出现了很多,请考虑编写包装函数:
if sys.version_info >= (3, 0):
def crlf_open(path, mode):
return open(path, mode, newline='')
else:
def crlf_open(path, mode):
return open(path, mode+'b')
Another thing to watch out for in writing multi-version code is that, if you're not writing locale-aware code, locale.getpreferredencoding(False)
almost always returns something reasonable in 3.x, but it will usually just return 'US-ASCII'
in 2.x. 在编写多版本代码时要注意的另一件事是,如果你不编写可
locale.getpreferredencoding(False)
语言环境的代码, locale.getpreferredencoding(False)
几乎总是在3.x中返回合理的东西,但它通常会返回'US-ASCII'
在2.x. Using locale.getpreferredencoding(True)
is technically incorrect, but may be more likely to be what you actually want if you don't want to think about encodings. 使用
locale.getpreferredencoding(True)
在技术上是不正确的,但如果您不想考虑编码,可能更有可能是您真正想要的。 (Try calling it both ways in your 2.x and 3.x interpreters to see why—or read the docs.) (尝试在2.x和3.x解释器中调用它以查看原因 - 或者阅读文档。)
Of course if you actually know the file's encoding, that's always better than guessing anyway. 当然,如果你真的知道文件的编码,那总是比猜测更好。
In either case, the 'r'
means "read-only". 在任何一种情况下,
'r'
表示“只读”。 If you don't specify a mode, the default is 'r'
, so the binary-mode equivalent to the default is 'rb'
. 如果未指定模式,则默认为
'r'
,因此等效于默认值的二进制模式为'rb'
。
You need to open the file in the binary mode: 您需要以二进制模式打开文件:
f = open('data.txt', 'rb')
data = f.read()
( 'r'
for "read", 'b'
for "binary") (
'r'
代表“读”, 'b'
代表“二进制”)
Then everything is returned as is, nothing is normalized 然后一切都按原样返回,没有任何标准化
You can use the codecs module to write 'version-agnostic' code: 您可以使用编解码器模块编写“版本无关”代码:
Underlying encoded files are always opened in binary mode.
底层编码文件始终以二进制模式打开。 No automatic conversion of
'\\n'
is done on reading and writing.在读写时不会自动转换
'\\n'
。 The mode argument may be any binary mode acceptable to the built-inopen()
function;mode参数可以是内置
open()
函数可接受的任何二进制模式; the'b'
is automatically added.'b'
会自动添加。
import codecs
with codecs.open('foo', mode='r', encoding='utf8') as f:
# python2: u'foo\r\n'
# python3: 'foo\r\n'
f.readline()
Just request "read binary" in the open
: 只需在
open
请求“读取二进制”:
f = open('data.txt', 'rb')
data = f.read()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.