在 python 中讀取大 txt 文件的有效方法

Question

我正在嘗試打開一個包含 4605227 行 (305 MB) 的 txt 文件

我以前這樣做的方式是：

data = np.loadtxt('file.txt', delimiter='\t', dtype=str, skiprows=1)

df = pd.DataFrame(data, columns=["a", "b", "c", "d", "e", "f", "g", "h", "i"])

df = df.astype(dtype={"a": "int64", "h": "int64", "i": "int64"})

但它用盡了大部分可用內存 ~10GB 並且沒有完成。 有沒有更快的方法來讀取這個 txt 文件並創建 pandas dataframe？

謝謝！

編輯：現在解決了，謝謝。 為什么 np.loadtxtx() 這么慢？

Answer 1

與其使用 numpy 讀取它，不如直接將其讀取為 Pandas DataFrame。 例如，使用pandas.read_csv function，類似於：

df = pd.read_csv('file.txt', delimiter='\t', usecols=["a", "b", "c", "d", "e", "f", "g", "h", "i"])

Answer 2

方法一：

您可以按塊讀取文件，此外還有一個緩沖區大小，您可以在 readline 中提及並且您可以讀取。

inputFile = open('inputTextFile','r')
buffer_line = inputFile.readlines(BUFFERSIZE)
while buffer_line:
    #logic goes here

方法二：

您也可以使用 nmap 模塊，下面是解釋用法的鏈接。

導入地圖

with open("hello.txt", "r+b") as f:
    # memory-map the file, size 0 means whole file
    mm = mmap.mmap(f.fileno(), 0)
    # read content via standard file methods
    print(mm.readline())  # prints b"Hello Python!\n"
    # read content via slice notation
    print(mm[:5])  # prints b"Hello"
    # update content using slice notation;
    # note that new content must have same size
    mm[6:] = b" world!\n"
    # ... and read again using standard file methods
    mm.seek(0)
    print(mm.readline())  # prints b"Hello  world!\n"
    # close the map
    mm.close()

https://docs.python.org/3/library/mmap.html

Answer 3

您直接將其讀取為 Pandas DataFrame。 例如

import pandas as pd
pd.read_csv(path)

如果你想更快地閱讀，你可以使用 modin：

import modin.pandas as pd
pd.read_csv(path)

https://github.com/modin-project/modin

Answer 4

下面的代碼將逐行讀取文件，它將在 for 循環中遍歷文件 object 中的每一行並根據需要處理這些行。

with open("file.txt") as fobj:

for line in fobj:

    print(line) #do your process

在 python 中讀取大 txt 文件的有效方法

問題描述

3 個解決方案

解決方案1
1 已采納 2019-11-14 15:29:02

解決方案2
0 2019-11-14 15:41:57

解決方案3
0 2019-11-14 17:08:38

解決方案4
-2 2019-11-14 15:31:14

在 python 中讀取大 txt 文件的有效方法

問題描述

3 個解決方案

解決方案1 1 已采納 2019-11-14 15:29:02

解決方案2 0 2019-11-14 15:41:57

解決方案3 0 2019-11-14 17:08:38

解決方案4 -2 2019-11-14 15:31:14

解決方案1
1 已采納 2019-11-14 15:29:02

解決方案2
0 2019-11-14 15:41:57

解決方案3
0 2019-11-14 17:08:38

解決方案4
-2 2019-11-14 15:31:14