獲取文本文件的第一行和最后一行的最有效方法是什么？

Question

我有一個文本文件，每行都包含一個時間戳。 我的目標是找到時間范圍。 所有的時間都是按順序排列的，所以第一行是最早的時間，最后一行是最晚的時間。 我只需要第一行和最后一行。 在python中獲取這些行的最有效方法是什么？

注意：這些文件的長度相對較大，每個大約有 1-2 百萬行，我必須對數百個文件執行此操作。

Answer 1

要同時讀取文件的第一行和最后一行，您可以...

打開文件，...
...使用內置readline()讀取第一行，...
... 尋找（移動光標）到文件末尾，...
... 向后退，直到遇到EOL （換行符）並 ...
...閱讀最后一行。

def readlastline(f):
    f.seek(-2, 2)              # Jump to the second last byte.
    while f.read(1) != b"\n":  # Until EOL is found ...
        f.seek(-2, 1)          # ... jump back, over the read byte plus one more.
    return f.read()            # Read all data from this point on.
    
with open(file, "rb") as f:
    first = f.readline()
    last = readlastline(f)

直接跳轉到倒數第二個字節，防止尾隨換行符導致返回空行*。

每次讀取一個字節時，當前偏移量都會向前推進一個，因此一次向后移動兩個字節，經過最近讀取的字節和下一個要讀取的字節。

傳遞給fseek(offset, whence=0)的whence參數表示fseek應該尋找相對於...的位置offset字節。

0或os.SEEK_SET = 文件的開頭。
1或os.SEEK_CUR = 當前位置。
2或os.SEEK_END = 文件的結尾。

* 正如預期的那樣，大多數應用程序（包括print和echo的默認行為是在寫入的每一行后附加一個，並且對缺少尾隨換行符的行沒有影響。

效率

每行 1-2 百萬行，我必須為數百個文件執行此操作。

我對這種方法計時並將其與最佳答案進行了比較。

10k iterations processing a file of 6k lines totalling 200kB: 1.62s vs 6.92s.
100 iterations processing a file of 6k lines totalling 1.3GB: 8.93s vs 86.95.

數以百萬計的行會增加差了很多。

用於計時的 Exakt 代碼：

with open(file, "rb") as f:
    first = f.readline()     # Read and store the first line.
    for last in f: pass      # Read all lines, keep final value.

修正案

一個更復雜、更難閱讀的變體，用於解決此后提出的評論和問題。

解析空文件時返回空字符串，由comment引發。
找不到分隔符時返回所有內容，由comment引發。
避免相對偏移以支持文本模式，由注釋引發。
UTF-16 / UTF32劈，由著名評論。

還增加了對多字節分隔符readlast(b'X<br>Y', b'<br>', fixed=False) 。

請注意，由於文本模式下需要非相對偏移量，因此這種變化對於大文件來說確實很慢。 根據您的需要進行修改，或者根本不使用它，因為您最好將f.readlines()[-1]與以文本模式打開的文件一起使用。

#!/bin/python3

from os import SEEK_END

def readlast(f, sep, fixed=True):
    r"""Read the last segment from a file-like object.

    :param f: File to read last line from.
    :type  f: file-like object
    :param sep: Segment separator (delimiter).
    :type  sep: bytes, str
    :param fixed: Treat data in ``f`` as a chain of fixed size blocks.
    :type  fixed: bool
    :returns: Last line of file.
    :rtype: bytes, str
    """
    bs   = len(sep)
    step = bs if fixed else 1
    if not bs:
        raise ValueError("Zero-length separator.")
    try:
        o = f.seek(0, SEEK_END)
        o = f.seek(o-bs-step)    # - Ignore trailing delimiter 'sep'.
        while f.read(bs) != sep: # - Until reaching 'sep': Read sep-sized block
            o = f.seek(o-step)   #  and then seek to the block to read next.
    except (OSError,ValueError): # - Beginning of file reached.
        f.seek(0)
    return f.read()

def test_readlast():
    from io import BytesIO, StringIO
    
    # Text mode.
    f = StringIO("first\nlast\n")
    assert readlast(f, "\n") == "last\n"
    
    # Bytes.
    f = BytesIO(b'first|last')
    assert readlast(f, b'|') == b'last'
    
    # Bytes, UTF-8.
    f = BytesIO("X\nY\n".encode("utf-8"))
    assert readlast(f, b'\n').decode() == "Y\n"
    
    # Bytes, UTF-16.
    f = BytesIO("X\nY\n".encode("utf-16"))
    assert readlast(f, b'\n\x00').decode('utf-16') == "Y\n"
  
    # Bytes, UTF-32.
    f = BytesIO("X\nY\n".encode("utf-32"))
    assert readlast(f, b'\n\x00\x00\x00').decode('utf-32') == "Y\n"
    
    # Multichar delimiter.
    f = StringIO("X<br>Y")
    assert readlast(f, "<br>", fixed=False) == "Y"
    
    # Make sure you use the correct delimiters.
    seps = { 'utf8': b'\n', 'utf16': b'\n\x00', 'utf32': b'\n\x00\x00\x00' }
    assert "\n".encode('utf8' )     == seps['utf8']
    assert "\n".encode('utf16')[2:] == seps['utf16']
    assert "\n".encode('utf32')[4:] == seps['utf32']
    
    # Edge cases.
    edges = (
        # Text , Match
        (""    , ""  ), # Empty file, empty string.
        ("X"   , "X" ), # No delimiter, full content.
        ("\n"  , "\n"),
        ("\n\n", "\n"),
        # UTF16/32 encoded U+270A (b"\n\x00\n'\n\x00"/utf16)
        (b'\n\xe2\x9c\x8a\n'.decode(), b'\xe2\x9c\x8a\n'.decode()),
    )
    for txt, match in edges:
        for enc,sep in seps.items():
            assert readlast(BytesIO(txt.encode(enc)), sep).decode(enc) == match

if __name__ == "__main__":
    import sys
    for path in sys.argv[1:]:
        with open(path) as f:
            print(f.readline()    , end="")
            print(readlast(f,"\n"), end="")

Answer 2

io 模塊的文檔

with open(fname, 'rb') as fh:
    first = next(fh).decode()

    fh.seek(-1024, 2)
    last = fh.readlines()[-1].decode()

這里的變量值是 1024：它代表平均字符串長度。 例如，我僅選擇 1024。 如果您有平均線長度的估計值，您可以使用該值乘以 2。

由於您對行長度的可能上限一無所知，因此顯而易見的解決方案是遍歷文件：

for line in fh:
    pass
last = line

您無需擔心可以使用open(fname)的二進制標志。

ETA ：由於您有許多文件要處理，您可以使用random.sample創建幾十個文件的樣本，並在它們上運行此代碼以確定最后一行的長度。 具有先驗大的位置偏移值（假設為 1 MB）。 這將幫助您估計完整運行的值。

Answer 3

這是 SilentGhost 答案的修改版本，可以滿足您的需求。

with open(fname, 'rb') as fh:
    first = next(fh)
    offs = -100
    while True:
        fh.seek(offs, 2)
        lines = fh.readlines()
        if len(lines)>1:
            last = lines[-1]
            break
        offs *= 2
    print first
    print last

這里不需要行長度的上限。

Answer 4

你可以使用unix命令嗎？ 我認為使用head -1和tail -n 1可能是最有效的方法。 或者，您可以使用簡單的fid.readline()來獲取第一行和fid.readlines()[-1] ，但這可能會占用太多內存。

Answer 5

這是我的解決方案，也與 Python3 兼容。 它也管理邊界情況，但它錯過了 utf-16 支持：

def tail(filepath):
    """
    @author Marco Sulla (marcosullaroma@gmail.com)
    @date May 31, 2016
    """

    try:
        filepath.is_file
        fp = str(filepath)
    except AttributeError:
        fp = filepath

    with open(fp, "rb") as f:
        size = os.stat(fp).st_size
        start_pos = 0 if size - 1 < 0 else size - 1

        if start_pos != 0:
            f.seek(start_pos)
            char = f.read(1)

            if char == b"\n":
                start_pos -= 1
                f.seek(start_pos)

            if start_pos == 0:
                f.seek(start_pos)
            else:
                char = ""

                for pos in range(start_pos, -1, -1):
                    f.seek(pos)

                    char = f.read(1)

                    if char == b"\n":
                        break

        return f.readline()

它是由Trasp's answer和AnotherParker 's comment 啟發的。

Answer 6

首先以讀取模式打開文件。然后使用 readlines() 方法逐行讀取。所有行存儲在列表中。現在您可以使用列表切片來獲取文件的第一行和最后一行。

    a=open('file.txt','rb')
    lines = a.readlines()
    if lines:
        first_line = lines[:1]
        last_line = lines[-1]

Answer 7

w=open(file.txt, 'r')
print ('first line is : ',w.readline())
for line in w:  
    x= line
print ('last line is : ',x)
w.close()

for循環遍歷這些行， x在最后一次迭代中獲取最后一行。

Answer 8

with open("myfile.txt") as f:
    lines = f.readlines()
    first_row = lines[0]
    print first_row
    last_row = lines[-1]
    print last_row

Answer 9

這是@Trasp 答案的擴展，它具有用於處理只有一行的文件的特殊情況的附加邏輯。 如果您反復想要讀取不斷更新的文件的最后一行，處理這種情況可能會很有用。 如果沒有這個，如果您嘗試獲取剛剛創建的文件的最后一行並且只有一行，則會IOError: [Errno 22] Invalid argument 。

def tail(filepath):
    with open(filepath, "rb") as f:
        first = f.readline()      # Read the first line.
        f.seek(-2, 2)             # Jump to the second last byte.
        while f.read(1) != b"\n": # Until EOL is found...
            try:
                f.seek(-2, 1)     # ...jump back the read byte plus one more.
            except IOError:
                f.seek(-1, 1)
                if f.tell() == 0:
                    break
        last = f.readline()       # Read last line.
    return last

Answer 10

沒有人提到使用反向：

f=open(file,"r")
r=reversed(f.readlines())
last_line_of_file = r.next()

Answer 11

獲得第一行非常容易。 對於最后一行，假設你知道一個大概上線長度上限， os.lseek一些量SEEK_END找到第二個，以結束最后一行，然后的ReadLine（）的最后一行。

Answer 12

with open(filename, "rb") as f:#Needs to be in binary mode for the seek from the end to work
    first = f.readline()
    if f.read(1) == '':
        return first
    f.seek(-2, 2)  # Jump to the second last byte.
    while f.read(1) != b"\n":  # Until EOL is found...
        f.seek(-2, 1)  # ...jump back the read byte plus one more.
    last = f.readline()  # Read last line.
    return last

以上答案是上述答案的修改版本，它處理文件中只有一行的情況

獲取文本文件的第一行和最后一行的最有效方法是什么？

問題描述

12 個解決方案

解決方案1
89 2013-09-03 23:29:19

效率

修正案

解決方案2
66 已采納 2010-07-27 18:06:46

解決方案3
25 2010-07-27 18:39:57

解決方案4
10 2010-07-27 18:07:27

解決方案5
6 2016-05-31 17:01:58

解決方案6
4 2013-09-06 04:35:31

解決方案7
4 2014-10-29 21:33:20

解決方案8
3 2015-01-31 01:40:50

解決方案9
2 2017-01-05 17:48:56

解決方案10
2 2018-06-20 05:17:30

解決方案11
1 2010-07-27 18:08:31

解決方案12
1 2018-07-29 08:50:56

獲取文本文件的第一行和最后一行的最有效方法是什么？

問題描述

12 個解決方案

解決方案1 89 2013-09-03 23:29:19

效率

修正案

解決方案2 66 已采納 2010-07-27 18:06:46

解決方案3 25 2010-07-27 18:39:57

解決方案4 10 2010-07-27 18:07:27

解決方案5 6 2016-05-31 17:01:58

解決方案6 4 2013-09-06 04:35:31

解決方案7 4 2014-10-29 21:33:20

解決方案8 3 2015-01-31 01:40:50

解決方案9 2 2017-01-05 17:48:56

解決方案10 2 2018-06-20 05:17:30

解決方案11 1 2010-07-27 18:08:31

解決方案12 1 2018-07-29 08:50:56

解決方案1
89 2013-09-03 23:29:19

解決方案2
66 已采納 2010-07-27 18:06:46

解決方案3
25 2010-07-27 18:39:57

解決方案4
10 2010-07-27 18:07:27

解決方案5
6 2016-05-31 17:01:58

解決方案6
4 2013-09-06 04:35:31

解決方案7
4 2014-10-29 21:33:20

解決方案8
3 2015-01-31 01:40:50

解決方案9
2 2017-01-05 17:48:56

解決方案10
2 2018-06-20 05:17:30

解決方案11
1 2010-07-27 18:08:31

解決方案12
1 2018-07-29 08:50:56