简体   繁体   English

阅读文件时不要转换换行符

[英]Don't convert newline when reading a file

I'm reading a text file: 我正在读一个文本文件:

f = open('data.txt')
data = f.read()

However newline in data variable is normalized to LF ('\\n') while the file contains CRLF ('\\r\\n'). 但是,当文件包含CRLF('\\ r \\ n')时, data变量中的换行符被标准化为LF('\\ n')。

How can I instruct Python to read the file as is? 如何指示Python按原样读取文件?

In Python 2.x: 在Python 2.x中:

f = open('data.txt', 'rb')

As the docs say: 正如文档所说:

The default is to use text mode, which may convert '\\n' characters to a platform-specific representation on writing and back on reading. 默认设置是使用文本模式,可以在写入时将“\\ n”字符转换为特定于平台的表示,并在读取时返回。 Thus, when opening a binary file, you should append 'b' to the mode value to open the file in binary mode, which will improve portability. 因此,在打开二进制文件时,您应该将'b'附加到模式值以在二进制模式下打开文件,这将提高可移植性。 (Appending 'b' is useful even on systems that don't treat binary and text files differently, where it serves as documentation.) (即使在不以不同方式处理二进制文件和文本文件的系统上,附加'b'也很有用,它可用作文档。)

In Python 3.x, there are three alternatives: 在Python 3.x中,有三种选择:

f1 = open('data.txt', 'rb')

This will leave newlines untransformed, but will also return bytes instead of str , which you will have to explicitly decode to Unicode yourself. 这将使换行保持未转换状态,但也将返回bytes而不是str ,您必须自己显式decode为Unicode。 (Of course the 2.x version also returned bytes that had to be decoded manually if you wanted Unicode, but in 2.x that's what a str object is; in 3.x str is Unicode.) (当然2.x版本还返回了必须手动解码的字节,如果你想要Unicode,但是在2.x中这就是str对象;在3.x str是Unicode。)

f2 = open('data.txt', 'r', newline='')

This will return str , and leave newlines untranslated. 这将返回str ,并保留未翻译的换行符。 Unlike the 2.x equivalent, however, readline and friends will treat '\\r\\n' as a newline, instead of a regular character followed by a newline. 然而,与2.x等价物不同, readline和朋友会将'\\r\\n'视为换行符,而不是常规字符后跟换行符。 Usually this won't matter, but if it does, keep it in mind. 通常这没关系,但如果确实如此,请记住。

f3 = open('data.txt', 'rb', encoding=locale.getpreferredencoding(False))

This treats newlines exactly the same way as the 2.x code, and returns str using the same encoding you'd get if you just used all of the defaults… but it's no longer valid in current 3.x. 这与2.x代码完全一样处理换行符,并且如果你刚刚使用了所有默认值,则使用相同的编码返回str ...但它在当前3.x中不再有效。

When reading input from the stream, if newline is None, universal newlines mode is enabled. 从流中读取输入时,如果换行为“无”,则启用通用换行模式。 Lines in the input can end in '\\n', '\\r', or '\\r\\n', and these are translated into '\\n' before being returned to the caller. 输入中的行可以以'\\ n','\\ r'或'\\ r \\ n'结尾,并且在返回给调用者之前将这些行转换为'\\ n'。 If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated. 如果是'',则启用通用换行模式,但行结尾将返回给未调换的调用者。

The reason you need to specify an explicit encoding for f3 is that opening a file in binary mode means the default changes from "decode with locale.getpreferredencoding(False) " to "don't decode, and return raw bytes instead of str ". 您需要为f3指定显式编码的原因是以二进制模式打开文件意味着默认从“使用locale.getpreferredencoding(False)解码”更改为“不解码,并返回原始bytes而不是str ”。 Again, from the docs : 再次,从文档

In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding. 在文本模式下,如果未指定编码,则使用的编码与平台相关:调用locale.getpreferredencoding(False)以获取当前的语言环境编码。 (For reading and writing raw bytes use binary mode and leave encoding unspecified.) (对于读取和写入原始字节,请使用二进制模式并保留未指定的编码。)

However: 然而:

'encoding' … should only be used in text mode. 'encoding'...只应在文本模式下使用。

And, at least as of 3.3, this is enforced; 并且,至少从3.3开始,这是强制执行的; if you try it with binary mode, you get ValueError: binary mode doesn't take an encoding argument . 如果你尝试二进制模式,你得到ValueError: binary mode doesn't take an encoding argument

So, if you want to write code that works on both 2.x and 3.x, what do you use? 所以,如果你想编写适用于2.x和3.x的代码,你会用什么? If you want to deal in bytes , obviously f and f1 are the same. But if you want to deal in 如果你想以bytes ,显然f和f1 are the same. But if you want to deal in are the same. But if you want to deal in str , as appropriate for each version, the simplest answer is to write different code for each, probably f and f2`, respectively. are the same. But if you want to deal in , as appropriate for each version, the simplest answer is to write different code for each, probably are the same. But if you want to deal in str , as appropriate for each version, the simplest answer is to write different code for each, probably f and f2`。 If this comes up a lot, consider writing either wrapper function: 如果这出现了很多,请考虑编写包装函数:

if sys.version_info >= (3, 0):
    def crlf_open(path, mode):
        return open(path, mode, newline='')
else:
    def crlf_open(path, mode):
        return open(path, mode+'b')

Another thing to watch out for in writing multi-version code is that, if you're not writing locale-aware code, locale.getpreferredencoding(False) almost always returns something reasonable in 3.x, but it will usually just return 'US-ASCII' in 2.x. 在编写多版本代码时要注意的另一件事是,如果你不编写可locale.getpreferredencoding(False)语言环境的代码, locale.getpreferredencoding(False)几乎总是在3.x中返回合理的东西,但它通常会返回'US-ASCII'在2.x. Using locale.getpreferredencoding(True) is technically incorrect, but may be more likely to be what you actually want if you don't want to think about encodings. 使用locale.getpreferredencoding(True)在技​​术上是不正确的,但如果您不想考虑编码,可能更有可能是您真正想要的。 (Try calling it both ways in your 2.x and 3.x interpreters to see why—or read the docs.) (尝试在2.x和3.x解释器中调用它以查看原因 - 或者阅读文档。)

Of course if you actually know the file's encoding, that's always better than guessing anyway. 当然,如果你真的知道文件的编码,那总是比猜测更好。

In either case, the 'r' means "read-only". 在任何一种情况下, 'r'表示“只读”。 If you don't specify a mode, the default is 'r' , so the binary-mode equivalent to the default is 'rb' . 如果未指定模式,则默认为'r' ,因此等效于默认值的二进制模式为'rb'

You need to open the file in the binary mode: 您需要以二进制模式打开文件:

f = open('data.txt', 'rb')
data = f.read()

( 'r' for "read", 'b' for "binary") 'r'代表“读”, 'b'代表“二进制”)

Then everything is returned as is, nothing is normalized 然后一切都按原样返回,没有任何标准化

You can use the codecs module to write 'version-agnostic' code: 您可以使用编解码器模块编写“版本无关”代码:

Underlying encoded files are always opened in binary mode. 底层编码文件始终以二进制模式打开。 No automatic conversion of '\\n' is done on reading and writing. 在读写时不会自动转换'\\n' The mode argument may be any binary mode acceptable to the built-in open() function; mode参数可以是内置open()函数可接受的任何二进制模式; the 'b' is automatically added. 'b'会自动添加。

import codecs
with codecs.open('foo', mode='r', encoding='utf8') as f:
    # python2: u'foo\r\n'
    # python3: 'foo\r\n'
    f.readline()

Just request "read binary" in the open : 只需在open请求“读取二进制”:

f = open('data.txt', 'rb')
data = f.read()

Open the file using open('data.txt', 'rb') . 使用open('data.txt', 'rb')打开文件。 See the doc . 文档

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从CSV文件读取并打印到文本文件时无法换行 - Can't get a newline when reading from CSV file and printing to Text file 使用 .readlines() 从 txt 文件读取时,如何插入换行符? - When reading from txt file with .readlines(), how to insert newline? 如何在读取csv文件时修复非法换行值 - How to fix an illegal newline value when reading a csv file 在Pyspark中读取JSON时在文件中尾随换行符导致空行 - Trailing newline in file when reading JSON in Pyspark results in empty line 读取具有指定分隔符的文件以换行 - Reading a file with a specified delimiter for newline 尽管文件不仅仅是换行符,但从 .docx 文件读取会产生换行符 - Reading from .docx file results in a newline though the file is not just a newline 当我不需要所有行和列时,如何使用带有 read_*(file) 的 Pandas 加快文件读取速度? - How to speed up file reading with Pandas with read_*(file) when I don't need all the rows and columns? 读取用双引号括起来但带有换行符的csv文件 - reading csv file enclosed in double quote but with newline 在读取行的行时在python中替换换行符 - Replace newline in python when reading line for line 读取文件并在每个“;”之后添加NewLine - Reading a file and adding NewLine after every “;”
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM