如何讀取大文本文件避免逐行讀取:: Python

Question

我有一個大數據文件（N，4），我正在逐行映射。 我的文件是 10 GB，下面給出了一個簡單的實現。 雖然以下工作，但它需要大量的時間。

我想實現這個邏輯，以便直接讀取文本文件並且我可以訪問元素。 此后，我需要根據第 2 列元素對整個（映射的）文件進行排序。

我在網上看到的示例假設數據塊較小（ d ）並使用f[:] = d[:]但我不能這樣做，因為d在我的情況下很大並且吃掉了我的 RAM。

PS：我知道如何使用 np.loadtxt 加載文件並使用np.loadtxt對其進行argsort ，但是對於 GB 文件大小，該邏輯失敗（內存錯誤）。 將不勝感激任何方向。

nrows, ncols = 20000000, 4  # nrows is really larger than this no. this is just for illustration
f = np.memmap('memmapped.dat', dtype=np.float32,
              mode='w+', shape=(nrows, ncols))

filename = "my_file.txt"

with open(filename) as file:

    for i, line in enumerate(file):
        floats = [float(x) for x in line.split(',')]
        f[i, :] = floats
del f

Answer 1

編輯：與其自己動手分塊，不如使用 pandas 的分塊功能，這比 numpy 的load_txt 。

import numpy as np
import pandas as pd

## create csv file for testing
np.random.seed(1)
nrows, ncols = 100000, 4
data = np.random.uniform(size=(nrows, ncols))
np.savetxt('bigdata.csv', data, delimiter=',')

## read it back
chunk_rows = 12345
# Replace np.empty by np.memmap array for large datasets.
odata = np.empty((nrows, ncols), dtype=np.float32)
oindex = 0
chunks = pd.read_csv('bigdata.csv', chunksize=chunk_rows, 
                     names=['a', 'b', 'c', 'd'])
for chunk in chunks:
    m, _ = chunk.shape
    odata[oindex:oindex+m, :] = chunk
    oindex += m

# check that it worked correctly.
assert np.allclose(data, odata, atol=1e-7)

分塊模式下的pd.read_csv function 返回一個特殊的 object 可以在循環中使用，例如for chunk in chunks: ; 在每次迭代中，它將讀取文件的一部分並將其內容作為 pandas DataFrame ，在這種情況下，可以將其視為 numpy 數組。 需要參數names以防止其將 csv 文件的第一行視為列名。

下面的舊答案

numpy.loadtxt function 使用文件名或將在構造中循環返回行的內容，例如：

for line in f: 
   do_something()

它甚至不需要偽裝成一個文件； 一個字符串列表就可以了！

我們可以讀取足夠小以適合 memory 的文件塊，並向np.loadtxt提供成批的行。

def get_file_lines(fname, seek, maxlen):
    """Read lines from a section of a file.
    
    Parameters:
        
    - fname: filename
    - seek: start position in the file
    - maxlen: maximum length (bytes) to read
    
    Return:
        
    - lines: list of lines (only entire lines).
    - seek_end: seek position at end of this chunk.
    
    Reference: https://stackoverflow.com/a/63043614/6228891
    Copying: any of CC-BY-SA, CC-BY, GPL, BSD, LPGL
    Author: Han-Kwang Nienhuys
    """
    f = open(fname, 'rb') # binary for Windows \r\n line endings
    f.seek(seek)
    buf = f.read(maxlen)
    n = len(buf)
    if n == 0:
        return [], seek
    
    # find a newline near the end
    for i in range(min(10000, n)):
        if buf[-i] == 0x0a:
            # newline
            buflen = n - i + 1
            lines = buf[:buflen].decode('utf-8').split('\n')
            seek_end = seek + buflen
            return lines, seek_end
    else:
        raise ValueError('Could not find end of line')

import numpy as np

## create csv file for testing
np.random.seed(1)
nrows, ncols = 10000, 4

data = np.random.uniform(size=(nrows, ncols))
np.savetxt('bigdata.csv', data, delimiter=',')

# read it back        
fpos = 0
chunksize = 456 # Small value for testing; make this big (megabytes).

# we will store the data here. Replace by memmap array if necessary.
odata = np.empty((nrows, ncols), dtype=np.float32)
oindex = 0

while True:
    lines, fpos = get_file_lines('bigdata.csv', fpos, chunksize)
    if not lines:
        # end of file
        break
    rdata = np.loadtxt(lines, delimiter=',')
    m, _ = rdata.shape
    odata[oindex:oindex+m, :] = rdata
    oindex += m
    
assert np.allclose(data, odata, atol=1e-7)

免責聲明：我在 Linux 中對此進行了測試。 我希望這可以在 Windows 中工作，但可能是處理 '\r' 字符會導致問題。

Answer 2

我意識到這不是一個答案，但是您是否考慮過使用二進制文件？ 當文件非常大時，以 ascii 保存是非常低效的。 如果可以，請改用 np.save 和 np.load 。

如何讀取大文本文件避免逐行讀取:: Python

問題描述

2 個解決方案

解決方案1
2 已采納 2020-07-22 21:40:22

下面的舊答案

解決方案2
-1 2020-07-22 22:06:14

如何讀取大文本文件避免逐行讀取:: Python

問題描述

2 個解決方案

解決方案1 2 已采納 2020-07-22 21:40:22

下面的舊答案

解決方案2 -1 2020-07-22 22:06:14

解決方案1
2 已采納 2020-07-22 21:40:22

解決方案2
-1 2020-07-22 22:06:14