简体   繁体   English

获取文本文件的第一行和最后一行的最有效方法是什么?

[英]What is the most efficient way to get first and last line of a text file?

I have a text file which contains a time stamp on each line.我有一个文本文件,每行都包含一个时间戳。 My goal is to find the time range.我的目标是找到时间范围。 All the times are in order so the first line will be the earliest time and the last line will be the latest time.所有的时间都是按顺序排列的,所以第一行是最早的时间,最后一行是最晚的时间。 I only need the very first and very last line.我只需要第一行和最后一行。 What would be the most efficient way to get these lines in python?在python中获取这些行的最有效方法是什么?

Note: These files are relatively large in length, about 1-2 million lines each and I have to do this for several hundred files.注意:这些文件的长度相对较大,每个大约有 1-2 百万行,我必须对数百个文件执行此操作。

To read both the first and final line of a file you could...要同时读取文件的第一行和最后一行,您可以...

  • open the file, ...打开文件,...
  • ... read the first line using built-in readline() , ... ...使用内置readline()读取第一行,...
  • ... seek (move the cursor) to the end of the file, ... ... 寻找(移动光标)到文件末尾,...
  • ... step backwards until you encounter EOL (line break) and ... ... 向后退,直到遇到EOL (换行符)并 ...
  • ... read the last line from there. ...阅读最后一行。
def readlastline(f):
    f.seek(-2, 2)              # Jump to the second last byte.
    while f.read(1) != b"\n":  # Until EOL is found ...
        f.seek(-2, 1)          # ... jump back, over the read byte plus one more.
    return f.read()            # Read all data from this point on.
    
with open(file, "rb") as f:
    first = f.readline()
    last = readlastline(f)

Jump to the second last byte directly to prevent trailing newline characters to cause empty lines to be returned*.直接跳转到倒数第二个字节,防止尾随换行符导致返回空行*。

The current offset is pushed ahead by one every time a byte is read so the stepping backwards is done two bytes at a time, past the recently read byte and the byte to read next.每次读取一个字节时,当前偏移量都会向前推进一个,因此一次向后移动两个字节,经过最近读取的字节和下一个要读取的字节。

The whence parameter passed to fseek(offset, whence=0) indicates that fseek should seek to a position offset bytes relative to...传递给fseek(offset, whence=0)whence参数表示fseek应该寻找相对于...的位置offset字节。

* As would be expected as the default behavior of most applications, including print and echo , is to append one to every line written and has no effect on lines missing trailing newline character. * 正如预期的那样,大多数应用程序(包括printecho的默认行为是在写入的每一行后附加一个,并且对缺少尾随换行符的行没有影响。


Efficiency效率

1-2 million lines each and I have to do this for several hundred files.每行 1-2 百万行,我必须为数百个文件执行此操作。

I timed this method and compared it against against the top answer.我对这种方法计时并将其与最佳答案进行了比较。

10k iterations processing a file of 6k lines totalling 200kB: 1.62s vs 6.92s.
100 iterations processing a file of 6k lines totalling 1.3GB: 8.93s vs 86.95.

Millions of lines would increase the difference a lot more.数以百万计的行会增加差了很多

Exakt code used for timing:用于计时的 Exakt 代码:

with open(file, "rb") as f:
    first = f.readline()     # Read and store the first line.
    for last in f: pass      # Read all lines, keep final value.

Amendment修正案

A more complex, and harder to read, variation to address comments and issues raised since.一个更复杂、更难阅读的变体,用于解决此后提出的评论和问题。

  • Return empty string when parsing empty file, raised by comment .解析空文件时返回空字符串,由comment引发。
  • Return all content when no delimiter is found, raised by comment .找不到分隔符时返回所有内容,由comment引发。
  • Avoid relative offsets to support text mode , raised by comment .避免相对偏移以支持文本模式,由注释引发。
  • UTF16/UTF32 hack, noted by comment . UTF-16 / UTF32劈,由著名评论

Also adds support for multibyte delimiters, readlast(b'X<br>Y', b'<br>', fixed=False) .还增加了对多字节分隔符readlast(b'X<br>Y', b'<br>', fixed=False)

Please note that this variation is really slow for large files because of the non-relative offsets needed in text mode.请注意,由于文本模式下需要非相对偏移量,因此这种变化对于大文件来说确实很慢。 Modify to your need, or do not use it at all as you're probably better off using f.readlines()[-1] with files opened in text mode.根据您的需要进行修改,或者根本不使用它,因为您最好将f.readlines()[-1]与以文本模式打开的文件一起使用。

#!/bin/python3

from os import SEEK_END

def readlast(f, sep, fixed=True):
    r"""Read the last segment from a file-like object.

    :param f: File to read last line from.
    :type  f: file-like object
    :param sep: Segment separator (delimiter).
    :type  sep: bytes, str
    :param fixed: Treat data in ``f`` as a chain of fixed size blocks.
    :type  fixed: bool
    :returns: Last line of file.
    :rtype: bytes, str
    """
    bs   = len(sep)
    step = bs if fixed else 1
    if not bs:
        raise ValueError("Zero-length separator.")
    try:
        o = f.seek(0, SEEK_END)
        o = f.seek(o-bs-step)    # - Ignore trailing delimiter 'sep'.
        while f.read(bs) != sep: # - Until reaching 'sep': Read sep-sized block
            o = f.seek(o-step)   #  and then seek to the block to read next.
    except (OSError,ValueError): # - Beginning of file reached.
        f.seek(0)
    return f.read()

def test_readlast():
    from io import BytesIO, StringIO
    
    # Text mode.
    f = StringIO("first\nlast\n")
    assert readlast(f, "\n") == "last\n"
    
    # Bytes.
    f = BytesIO(b'first|last')
    assert readlast(f, b'|') == b'last'
    
    # Bytes, UTF-8.
    f = BytesIO("X\nY\n".encode("utf-8"))
    assert readlast(f, b'\n').decode() == "Y\n"
    
    # Bytes, UTF-16.
    f = BytesIO("X\nY\n".encode("utf-16"))
    assert readlast(f, b'\n\x00').decode('utf-16') == "Y\n"
  
    # Bytes, UTF-32.
    f = BytesIO("X\nY\n".encode("utf-32"))
    assert readlast(f, b'\n\x00\x00\x00').decode('utf-32') == "Y\n"
    
    # Multichar delimiter.
    f = StringIO("X<br>Y")
    assert readlast(f, "<br>", fixed=False) == "Y"
    
    # Make sure you use the correct delimiters.
    seps = { 'utf8': b'\n', 'utf16': b'\n\x00', 'utf32': b'\n\x00\x00\x00' }
    assert "\n".encode('utf8' )     == seps['utf8']
    assert "\n".encode('utf16')[2:] == seps['utf16']
    assert "\n".encode('utf32')[4:] == seps['utf32']
    
    # Edge cases.
    edges = (
        # Text , Match
        (""    , ""  ), # Empty file, empty string.
        ("X"   , "X" ), # No delimiter, full content.
        ("\n"  , "\n"),
        ("\n\n", "\n"),
        # UTF16/32 encoded U+270A (b"\n\x00\n'\n\x00"/utf16)
        (b'\n\xe2\x9c\x8a\n'.decode(), b'\xe2\x9c\x8a\n'.decode()),
    )
    for txt, match in edges:
        for enc,sep in seps.items():
            assert readlast(BytesIO(txt.encode(enc)), sep).decode(enc) == match

if __name__ == "__main__":
    import sys
    for path in sys.argv[1:]:
        with open(path) as f:
            print(f.readline()    , end="")
            print(readlast(f,"\n"), end="")

docs for io module io 模块的文档

with open(fname, 'rb') as fh:
    first = next(fh).decode()

    fh.seek(-1024, 2)
    last = fh.readlines()[-1].decode()

The variable value here is 1024: it represents the average string length.这里的变量值是 1024:它代表平均字符串长度。 I choose 1024 only for example.例如,我仅选择 1024。 If you have an estimate of average line length you could just use that value times 2.如果您有平均线长度的估计值,您可以使用该值乘以 2。

Since you have no idea whatsoever about the possible upper bound for the line length, the obvious solution would be to loop over the file:由于您对行长度的可能上限一无所知,因此显而易见的解决方案是遍历文件:

for line in fh:
    pass
last = line

You don't need to bother with the binary flag you could just use open(fname) .您无需担心可以使用open(fname)的二进制标志。

ETA : Since you have many files to work on, you could create a sample of couple of dozens of files using random.sample and run this code on them to determine length of last line. ETA :由于您有许多文件要处理,您可以使用random.sample创建几十个文件的样本,并在它们上运行此代码以确定最后一行的长度。 With an a priori large value of the position shift (let say 1 MB).具有先验大的位置偏移值(假设为 1 MB)。 This will help you to estimate the value for the full run.这将帮助您估计完整运行的值。

Here's a modified version of SilentGhost's answer that will do what you want.这是 SilentGhost 答案的修改版本,可以满足您的需求。

with open(fname, 'rb') as fh:
    first = next(fh)
    offs = -100
    while True:
        fh.seek(offs, 2)
        lines = fh.readlines()
        if len(lines)>1:
            last = lines[-1]
            break
        offs *= 2
    print first
    print last

No need for an upper bound for line length here.这里不需要行长度的上限。

Can you use unix commands?你可以使用unix命令吗? I think using head -1 and tail -n 1 are probably the most efficient methods.我认为使用head -1tail -n 1可能是最有效的方法。 Alternatively, you could use a simple fid.readline() to get the first line and fid.readlines()[-1] , but that may take too much memory.或者,您可以使用简单的fid.readline()来获取第一行和fid.readlines()[-1] ,但这可能会占用太多内存。

This is my solution, compatible also with Python3.这是我的解决方案,也与 Python3 兼容。 It does also manage border cases, but it misses utf-16 support:它也管理边界情况,但它错过了 utf-16 支持:

def tail(filepath):
    """
    @author Marco Sulla (marcosullaroma@gmail.com)
    @date May 31, 2016
    """

    try:
        filepath.is_file
        fp = str(filepath)
    except AttributeError:
        fp = filepath

    with open(fp, "rb") as f:
        size = os.stat(fp).st_size
        start_pos = 0 if size - 1 < 0 else size - 1

        if start_pos != 0:
            f.seek(start_pos)
            char = f.read(1)

            if char == b"\n":
                start_pos -= 1
                f.seek(start_pos)

            if start_pos == 0:
                f.seek(start_pos)
            else:
                char = ""

                for pos in range(start_pos, -1, -1):
                    f.seek(pos)

                    char = f.read(1)

                    if char == b"\n":
                        break

        return f.readline()

It's ispired by Trasp's answer and AnotherParker's comment .它是由Trasp's answerAnotherParker 's comment启发的

First open the file in read mode.Then use readlines() method to read line by line.All the lines stored in a list.Now you can use list slices to get first and last lines of the file.首先以读取模式打开文件。然后使用 readlines() 方法逐行读取。所有行存储在列表中。现在您可以使用列表切片来获取文件的第一行和最后一行。

    a=open('file.txt','rb')
    lines = a.readlines()
    if lines:
        first_line = lines[:1]
        last_line = lines[-1]
w=open(file.txt, 'r')
print ('first line is : ',w.readline())
for line in w:  
    x= line
print ('last line is : ',x)
w.close()

The for loop runs through the lines and x gets the last line on the final iteration. for循环遍历这些行, x在最后一次迭代中获取最后一行。

with open("myfile.txt") as f:
    lines = f.readlines()
    first_row = lines[0]
    print first_row
    last_row = lines[-1]
    print last_row

Here is an extension of @Trasp's answer that has additional logic for handling the corner case of a file that has only one line.这是@Trasp 答案的扩展,它具有用于处理只有一行的文件的特殊情况的附加逻辑。 It may be useful to handle this case if you repeatedly want to read the last line of a file that is continuously being updated.如果您反复想要读取不断更新的文件的最后一行,处理这种情况可能会很有用。 Without this, if you try to grab the last line of a file that has just been created and has only one line, IOError: [Errno 22] Invalid argument will be raised.如果没有这个,如果您尝试获取刚刚创建的文件的最后一行并且只有一行,则会IOError: [Errno 22] Invalid argument

def tail(filepath):
    with open(filepath, "rb") as f:
        first = f.readline()      # Read the first line.
        f.seek(-2, 2)             # Jump to the second last byte.
        while f.read(1) != b"\n": # Until EOL is found...
            try:
                f.seek(-2, 1)     # ...jump back the read byte plus one more.
            except IOError:
                f.seek(-1, 1)
                if f.tell() == 0:
                    break
        last = f.readline()       # Read last line.
    return last

Nobody mentioned using reversed:没有人提到使用反向:

f=open(file,"r")
r=reversed(f.readlines())
last_line_of_file = r.next()

Getting the first line is trivially easy.获得第一行非常容易。 For the last line, presuming you know an approximate upper bound on the line length, os.lseek some amount from SEEK_END find the second to last line ending and then readline() the last line.对于最后一行,假设你知道一个大概上线长度上限, os.lseek一些量SEEK_END找到第二个,以结束最后一行,然后的ReadLine()的最后一行。

with open(filename, "rb") as f:#Needs to be in binary mode for the seek from the end to work
    first = f.readline()
    if f.read(1) == '':
        return first
    f.seek(-2, 2)  # Jump to the second last byte.
    while f.read(1) != b"\n":  # Until EOL is found...
        f.seek(-2, 1)  # ...jump back the read byte plus one more.
    last = f.readline()  # Read last line.
    return last

The above answer is a modified version of the above answers which handles the case that there is only one line in the file以上答案是上述答案的修改版本,它处理文件中只有一行的情况

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在Python中修改大型文本文件的最后一行的最有效方法 - Most efficient way to modify the last line of a large text file in Python 遍历文件每一行的最有效方法是什么? - What is the most efficient way of looping over each line of a file? 从文本文档“蚕食”第一行文本然后在python中重新保存的最有效方法 - Most efficient way to “nibble” the first line of text from a text document then resave it in python 什么是在QuerySet中获得排名的最有效方法? - What is most efficient way to get ranking in QuerySet? Python-查找文本文件中同一行中每个可能的单词对出现频率的最有效方法? - Python - Most efficient way to find how often each possible pair of words occurs in the same line in a text file? 搜索文件最后 X 行的最有效方法? - Most efficient way to search the last X lines of a file? 插入后获取Tkinter Text小部件的总显示行的最有效方法是什么? - What's the most efficient way to get a Tkinter Text widget's total display lines after the insert? 重复搜索大型文本文件(800 MB)中某些数字的最有效方法是什么? - What is the most efficient way to repeatedly search a large text file (800 MB) for certain numbers? 在python中找到直线和圆的交点的最有效方法是什么? - What is most efficient way to find the intersection of a line and a circle in python? 在 Python 中获取整数的第一个数字的最有效方法? - The most efficient way to get the first number of an integer in Python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM