简体   繁体   English

如何在python中只读取带有readlines的回车符?

[英]How to split only on carriage returns with readlines in python?

I have a text file that contains both \\n and \\r\\n end-of-line markers. 我有一个包含\\n\\r\\n行尾标记的文本文件。 I want to split only on \\r\\n , but can't figure out a way to do this with python's readlines method. 我想只在\\r\\n上拆分,但无法用python的readlines方法找到一种方法。 Is there a simple workaround for this? 有一个简单的解决方法吗?

As @eskaev mentions, you'll usually want to avoid reading the complete file into memory if not necessary. 正如@eskaev所提到的,如果没有必要,你通常会希望避免将完整的文件读入内存。

io.open() allows you to specify a newline keyword argument, so you can still iterate over lines and have them split only at the specified newlines: io.open()允许您指定newline关键字参数,因此您仍然可以迭代行并使它们在指定的换行符处拆分:

import io

for line in io.open('in.txt', newline='\r\n'):
    print repr(line)

Output: 输出:

u'this\nis\nsome\r\n'
u'text\nwith\nnewlines.'

Avoid reading it in text mode. 避免在文本模式下阅读它。 Python reads texts files with universal newline support . Python使用通用换行支持读取文本文件。 This means that all line endings are interpreted as \\n : 这意味着所有行结尾都被解释为\\n

>>> with open('out', 'wb') as f:
...     f.write(b'a\nb\r\nc\r\nd\ne\r\nf')
... 
14
>>> with open('out', 'r') as f: f.readlines()
... 
['a\n', 'b\n', 'c\n', 'd\n', 'e\n', 'f']

Note that using U doesn't change the result 1 : 请注意,使用U不会更改结果1

>>> with open('out', 'rU') as f: f.readlines()
... 
['a\n', 'b\n', 'c\n', 'd\n', 'e\n', 'f']

However you can always read the file in binary mode, decode it, and then split on \\r\\n : 但是,您始终可以在二进制模式下读取文件,对其进行解码,然后在\\r\\n上拆分:

>>> with open('out', 'rb') as f: f.read().split(b'\r\n')
... 
[b'a\nb', b'c', b'd\ne', b'f']

(example in python3. You can decode the bytes into unicode either before or after the split ). (例如在python3中。您可以在split之前或之后将字节decode为unicode)。

you can avoid reading the whole file into memory and read it in blocks instead. 您可以避免将整个文件读入内存并以块的形式读取。 However it becomes a bit mroe complex to correctly handle the lines (you have to manually check where the last line started and concatenate it to the following block). 但是,正确处理这些行会变得有点复杂(您必须手动检查最后一行的开始位置并将其连接到下一个块)。


1 I believe it's because universal newline is enabled by default in all normal installations. 1我相信这是因为在所有正常安装中默认启用通用换行符。 You have to explicitly disable it when configuring the installation and then the r and rU mode would have different behaviours (the first would only split lines on the OS native line endings, the latter would produce the result shown above). 您必须在配置安装时明确禁用它, 然后 rrU模式将具有不同的行为(第一个只会在OS本机行结尾上分割行,后者将产生上面显示的结果)。

Instead of using readline, just use read and the split. 而不是使用readline,只需使用read和split。

For Example 例如

with open('/path/to/file', 'r') as f:
    fileContents = f.read() #read entire file
    filePieces = fileContents.split('\r\n')

This approach reads the file as a generator in chunks split by your separator. 此方法将文件读取为分隔符拆分的块中的生成器。

ifs = open(myFile)
for chunk in ifs.read().split(mySep):
    #do something with the chunk

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM