[英]What is the most efficient way to get first and last line of a text file?
I have a text file which contains a time stamp on each line.我有一个文本文件,每行都包含一个时间戳。 My goal is to find the time range.
我的目标是找到时间范围。 All the times are in order so the first line will be the earliest time and the last line will be the latest time.
所有的时间都是按顺序排列的,所以第一行是最早的时间,最后一行是最晚的时间。 I only need the very first and very last line.
我只需要第一行和最后一行。 What would be the most efficient way to get these lines in python?
在python中获取这些行的最有效方法是什么?
Note: These files are relatively large in length, about 1-2 million lines each and I have to do this for several hundred files.注意:这些文件的长度相对较大,每个大约有 1-2 百万行,我必须对数百个文件执行此操作。
To read both the first and final line of a file you could...要同时读取文件的第一行和最后一行,您可以...
readline()
, ... readline()
读取第一行,...def readlastline(f):
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found ...
f.seek(-2, 1) # ... jump back, over the read byte plus one more.
return f.read() # Read all data from this point on.
with open(file, "rb") as f:
first = f.readline()
last = readlastline(f)
Jump to the second last byte directly to prevent trailing newline characters to cause empty lines to be returned*.直接跳转到倒数第二个字节,防止尾随换行符导致返回空行*。
The current offset is pushed ahead by one every time a byte is read so the stepping backwards is done two bytes at a time, past the recently read byte and the byte to read next.每次读取一个字节时,当前偏移量都会向前推进一个,因此一次向后移动两个字节,经过最近读取的字节和下一个要读取的字节。
The whence
parameter passed to fseek(offset, whence=0)
indicates that fseek
should seek to a position offset
bytes relative to...传递给
fseek(offset, whence=0)
的whence
参数表示fseek
应该寻找相对于...的位置offset
字节。
0
or os.SEEK_SET
= The beginning of the file. 0
或os.SEEK_SET
= 文件的开头。1
or os.SEEK_CUR
= The current position. 1
或os.SEEK_CUR
= 当前位置。2
or os.SEEK_END
= The end of the file. 2
或os.SEEK_END
= 文件的结尾。 * As would be expected as the default behavior of most applications, including print
and echo
, is to append one to every line written and has no effect on lines missing trailing newline character. * 正如预期的那样,大多数应用程序(包括
print
和echo
的默认行为是在写入的每一行后附加一个,并且对缺少尾随换行符的行没有影响。
1-2 million lines each and I have to do this for several hundred files.
每行 1-2 百万行,我必须为数百个文件执行此操作。
I timed this method and compared it against against the top answer.我对这种方法计时并将其与最佳答案进行了比较。
10k iterations processing a file of 6k lines totalling 200kB: 1.62s vs 6.92s.
100 iterations processing a file of 6k lines totalling 1.3GB: 8.93s vs 86.95.
Millions of lines would increase the difference a lot more.数以百万计的行会增加差了很多。
Exakt code used for timing:用于计时的 Exakt 代码:
with open(file, "rb") as f:
first = f.readline() # Read and store the first line.
for last in f: pass # Read all lines, keep final value.
A more complex, and harder to read, variation to address comments and issues raised since.一个更复杂、更难阅读的变体,用于解决此后提出的评论和问题。
Also adds support for multibyte delimiters, readlast(b'X<br>Y', b'<br>', fixed=False)
.还增加了对多字节分隔符
readlast(b'X<br>Y', b'<br>', fixed=False)
。
Please note that this variation is really slow for large files because of the non-relative offsets needed in text mode.请注意,由于文本模式下需要非相对偏移量,因此这种变化对于大文件来说确实很慢。 Modify to your need, or do not use it at all as you're probably better off using
f.readlines()[-1]
with files opened in text mode.根据您的需要进行修改,或者根本不使用它,因为您最好将
f.readlines()[-1]
与以文本模式打开的文件一起使用。
#!/bin/python3
from os import SEEK_END
def readlast(f, sep, fixed=True):
r"""Read the last segment from a file-like object.
:param f: File to read last line from.
:type f: file-like object
:param sep: Segment separator (delimiter).
:type sep: bytes, str
:param fixed: Treat data in ``f`` as a chain of fixed size blocks.
:type fixed: bool
:returns: Last line of file.
:rtype: bytes, str
"""
bs = len(sep)
step = bs if fixed else 1
if not bs:
raise ValueError("Zero-length separator.")
try:
o = f.seek(0, SEEK_END)
o = f.seek(o-bs-step) # - Ignore trailing delimiter 'sep'.
while f.read(bs) != sep: # - Until reaching 'sep': Read sep-sized block
o = f.seek(o-step) # and then seek to the block to read next.
except (OSError,ValueError): # - Beginning of file reached.
f.seek(0)
return f.read()
def test_readlast():
from io import BytesIO, StringIO
# Text mode.
f = StringIO("first\nlast\n")
assert readlast(f, "\n") == "last\n"
# Bytes.
f = BytesIO(b'first|last')
assert readlast(f, b'|') == b'last'
# Bytes, UTF-8.
f = BytesIO("X\nY\n".encode("utf-8"))
assert readlast(f, b'\n').decode() == "Y\n"
# Bytes, UTF-16.
f = BytesIO("X\nY\n".encode("utf-16"))
assert readlast(f, b'\n\x00').decode('utf-16') == "Y\n"
# Bytes, UTF-32.
f = BytesIO("X\nY\n".encode("utf-32"))
assert readlast(f, b'\n\x00\x00\x00').decode('utf-32') == "Y\n"
# Multichar delimiter.
f = StringIO("X<br>Y")
assert readlast(f, "<br>", fixed=False) == "Y"
# Make sure you use the correct delimiters.
seps = { 'utf8': b'\n', 'utf16': b'\n\x00', 'utf32': b'\n\x00\x00\x00' }
assert "\n".encode('utf8' ) == seps['utf8']
assert "\n".encode('utf16')[2:] == seps['utf16']
assert "\n".encode('utf32')[4:] == seps['utf32']
# Edge cases.
edges = (
# Text , Match
("" , "" ), # Empty file, empty string.
("X" , "X" ), # No delimiter, full content.
("\n" , "\n"),
("\n\n", "\n"),
# UTF16/32 encoded U+270A (b"\n\x00\n'\n\x00"/utf16)
(b'\n\xe2\x9c\x8a\n'.decode(), b'\xe2\x9c\x8a\n'.decode()),
)
for txt, match in edges:
for enc,sep in seps.items():
assert readlast(BytesIO(txt.encode(enc)), sep).decode(enc) == match
if __name__ == "__main__":
import sys
for path in sys.argv[1:]:
with open(path) as f:
print(f.readline() , end="")
print(readlast(f,"\n"), end="")
with open(fname, 'rb') as fh:
first = next(fh).decode()
fh.seek(-1024, 2)
last = fh.readlines()[-1].decode()
The variable value here is 1024: it represents the average string length.这里的变量值是 1024:它代表平均字符串长度。 I choose 1024 only for example.
例如,我仅选择 1024。 If you have an estimate of average line length you could just use that value times 2.
如果您有平均线长度的估计值,您可以使用该值乘以 2。
Since you have no idea whatsoever about the possible upper bound for the line length, the obvious solution would be to loop over the file:由于您对行长度的可能上限一无所知,因此显而易见的解决方案是遍历文件:
for line in fh:
pass
last = line
You don't need to bother with the binary flag you could just use open(fname)
.您无需担心可以使用
open(fname)
的二进制标志。
ETA : Since you have many files to work on, you could create a sample of couple of dozens of files using random.sample
and run this code on them to determine length of last line. ETA :由于您有许多文件要处理,您可以使用
random.sample
创建几十个文件的样本,并在它们上运行此代码以确定最后一行的长度。 With an a priori large value of the position shift (let say 1 MB).具有先验大的位置偏移值(假设为 1 MB)。 This will help you to estimate the value for the full run.
这将帮助您估计完整运行的值。
Here's a modified version of SilentGhost's answer that will do what you want.这是 SilentGhost 答案的修改版本,可以满足您的需求。
with open(fname, 'rb') as fh:
first = next(fh)
offs = -100
while True:
fh.seek(offs, 2)
lines = fh.readlines()
if len(lines)>1:
last = lines[-1]
break
offs *= 2
print first
print last
No need for an upper bound for line length here.这里不需要行长度的上限。
Can you use unix commands?你可以使用unix命令吗? I think using
head -1
and tail -n 1
are probably the most efficient methods.我认为使用
head -1
和tail -n 1
可能是最有效的方法。 Alternatively, you could use a simple fid.readline()
to get the first line and fid.readlines()[-1]
, but that may take too much memory.或者,您可以使用简单的
fid.readline()
来获取第一行和fid.readlines()[-1]
,但这可能会占用太多内存。
This is my solution, compatible also with Python3.这是我的解决方案,也与 Python3 兼容。 It does also manage border cases, but it misses utf-16 support:
它也管理边界情况,但它错过了 utf-16 支持:
def tail(filepath):
"""
@author Marco Sulla (marcosullaroma@gmail.com)
@date May 31, 2016
"""
try:
filepath.is_file
fp = str(filepath)
except AttributeError:
fp = filepath
with open(fp, "rb") as f:
size = os.stat(fp).st_size
start_pos = 0 if size - 1 < 0 else size - 1
if start_pos != 0:
f.seek(start_pos)
char = f.read(1)
if char == b"\n":
start_pos -= 1
f.seek(start_pos)
if start_pos == 0:
f.seek(start_pos)
else:
char = ""
for pos in range(start_pos, -1, -1):
f.seek(pos)
char = f.read(1)
if char == b"\n":
break
return f.readline()
It's ispired by Trasp's answer and AnotherParker's comment .它是由Trasp's answer和AnotherParker 's comment启发的。
First open the file in read mode.Then use readlines() method to read line by line.All the lines stored in a list.Now you can use list slices to get first and last lines of the file.首先以读取模式打开文件。然后使用 readlines() 方法逐行读取。所有行存储在列表中。现在您可以使用列表切片来获取文件的第一行和最后一行。
a=open('file.txt','rb')
lines = a.readlines()
if lines:
first_line = lines[:1]
last_line = lines[-1]
w=open(file.txt, 'r')
print ('first line is : ',w.readline())
for line in w:
x= line
print ('last line is : ',x)
w.close()
The for
loop runs through the lines and x
gets the last line on the final iteration. for
循环遍历这些行, x
在最后一次迭代中获取最后一行。
with open("myfile.txt") as f:
lines = f.readlines()
first_row = lines[0]
print first_row
last_row = lines[-1]
print last_row
Here is an extension of @Trasp's answer that has additional logic for handling the corner case of a file that has only one line.这是@Trasp 答案的扩展,它具有用于处理只有一行的文件的特殊情况的附加逻辑。 It may be useful to handle this case if you repeatedly want to read the last line of a file that is continuously being updated.
如果您反复想要读取不断更新的文件的最后一行,处理这种情况可能会很有用。 Without this, if you try to grab the last line of a file that has just been created and has only one line,
IOError: [Errno 22] Invalid argument
will be raised.如果没有这个,如果您尝试获取刚刚创建的文件的最后一行并且只有一行,则会
IOError: [Errno 22] Invalid argument
。
def tail(filepath):
with open(filepath, "rb") as f:
first = f.readline() # Read the first line.
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found...
try:
f.seek(-2, 1) # ...jump back the read byte plus one more.
except IOError:
f.seek(-1, 1)
if f.tell() == 0:
break
last = f.readline() # Read last line.
return last
Nobody mentioned using reversed:没有人提到使用反向:
f=open(file,"r")
r=reversed(f.readlines())
last_line_of_file = r.next()
Getting the first line is trivially easy.获得第一行非常容易。 For the last line, presuming you know an approximate upper bound on the line length, os.lseek some amount from
SEEK_END
find the second to last line ending and then readline() the last line.对于最后一行,假设你知道一个大概上线长度上限, os.lseek一些量
SEEK_END
找到第二个,以结束最后一行,然后的ReadLine()的最后一行。
with open(filename, "rb") as f:#Needs to be in binary mode for the seek from the end to work
first = f.readline()
if f.read(1) == '':
return first
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found...
f.seek(-2, 1) # ...jump back the read byte plus one more.
last = f.readline() # Read last line.
return last
The above answer is a modified version of the above answers which handles the case that there is only one line in the file以上答案是上述答案的修改版本,它处理文件中只有一行的情况
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.