使用python刪除一個巨大的csv文件的前兩行

Question

我想使用python刪除具有良好性能的巨大csv文件（3GB）的標題和第一行。

import csv
import pandas as pd

def remove2rows(csv_file):
    data = pd.read_csv(csv_file)
    data = data.iloc[1:]
    data.to_csv(csv_file, header=None, index=False)

if __name__ == "__main__":
    remove2rows(filename)

此腳本有效但需要一些時間，可能是因為它讀取整個文件並將從第 3 行開始到文件末尾的每一行寫入一個新的 csv 文件。

有什么方法可以提高性能嗎？

Answer 1

請注意，“從文件中刪除行”的唯一方法是讀取整個文件（盡管不一定一次全部 xD）並將選定的行寫回新文件。 這就是文件的工作方式。

但是，在這里不使用 panda 肯定會節省時間——panda 是一種用於對表格數據進行計算的工具，而不是文件實用程序。 使用 stdlib 的 csv 模塊或者更簡單的只是簡單的文件功能（如果你 101% 確定你的 csv 不包含嵌入的換行符）可能會更有效，至少 wrt/內存使用，並且可能 wrt/raw perfs。

Answer 2

問題：刪除一個巨大的 csv 文件的前兩行

這個例子做：
找到第二個 NewLine 的偏移量，將文件位置改為它並復制到文件末尾。

如果您獲得任何改進的性能，請返回報告！

參考：

bytes.find(sub[, start[, end]])

返回數據中找到子序列 sub 的最低索引，
seek(offset, whence=SEEK_SET)

將文件位置更改為給定的字節偏移量。
shutil.copyfileobj(fsrc, fdst[, length])

將復制從當前文件位置到文件末尾的內容。

import io, shutil

DATA = b"""First line to be skipped
Second line to be skipped
Data Line 1
Data Line 2
Data Line 3
"""

def main():    
    # with open('in_filename', 'rb') as in_fh, open('out_filename', 'wb') as out_fh:
    with io.BytesIO(DATA) as in_fh, io.BytesIO() as out_fh:

        # Find the offset of the second NewLine
        # Assuming it within the first 70 bytes
        # Assuming NO embeded NewLine
        # Adjust it to your needs
        buffer = in_fh.read(70)

        offset = 0
        for n in range(2):
            offset = buffer.find(b'\n', offset) + 1

        print('Change the file position to: {}'.format(offset))
        in_fh.seek(offset)

        # Copy to the end of the file
        shutil.copyfileobj(in_fh, out_fh)

        # This is only for demo printing the result
        print(out_fh.getvalue())

if __name__ == "__main__":
    main()

輸出：

 Change the file position to: 59 b'Data Line 1\\nData Line 2\\nData Line 3\\n'

用 Python 測試：3.5

使用python刪除一個巨大的csv文件的前兩行

問題描述

2 個解決方案

解決方案1
0 2019-12-18 11:16:28

解決方案2
0 2019-12-18 13:59:15

使用python刪除一個巨大的csv文件的前兩行

問題描述

2 個解決方案

解決方案1 0 2019-12-18 11:16:28

解決方案2 0 2019-12-18 13:59:15

解決方案1
0 2019-12-18 11:16:28

解決方案2
0 2019-12-18 13:59:15