使用 Python 進行預處理后，將 large.txt 文件（大小 >30GB）.txt 轉換為.csv 的最有效方法

Question

我在 a.txt 文件中有如下所示的數據（我們將其命名為“myfile.txt”）：

28807644'~'0'~'Maun FCU'~'US#@#@#28855353'~'0'~'WNB Holdings LLC'~'US#@#@#29212330'~'0'~'Idaho First Bank '~'US#@#@#29278777'~'0'~'Republic Bank of Arizona'~'US#@#@#29633181'~'0'~'Friendly Hills Bank'~'US#@#@# 29760145'~'0'~'弗吉尼亞自由銀行'~'US#@#@#100504846'~'0'~'Community First Fund Federal Credit Union'~'US#@#@#

我嘗試了幾種方法將 this.txt 轉換為 a.csv，其中一種是使用 CSV 庫，但由於我非常喜歡 Panda，所以我使用了以下方法：

import pandas as pd
import time
  
#time at the start of program is noted
start = time.time()

# We set the path where our file is located and read it
path = r'myfile.txt'
f =  open(path, 'r')
content = f.read()
# We replace undesired strings and introduce a breakline.
content_filtered = content.replace("#@#@#", "\n").replace("'", "")
# We read everything in columns with the separator "~" 
df = pd.DataFrame([x.split('~') for x in content_filtered.split('\n')], columns = ['a', 'b', 'c', 'd'])
# We print the dataframe into a csv
df.to_csv(path.replace('.txt', '.csv'), index = None)
end = time.time()
  
#total time taken to print the file
print("Execution time in seconds: ",(end - start))

這需要大約 35 秒來處理，是一個 300MB 的文件，我可以接受這種類型的性能，但我正在嘗試對一個更大的文件（大小為 35GB）做同樣的事情，它會產生一個 MemoryError 消息。

我嘗試使用 CSV 庫，但結果相似，我嘗試將所有內容放入列表中，然后將其寫入 CSV：

import csv
# We write to CSV
with open(path.replace('.txt', '.csv'), "w") as outfile:
    write = csv.writer(outfile)
    write.writerows(split_content)

結果是相似的，沒有很大的改進。 有沒有一種方法可以將非常大的 .txt 文件轉換為 .csv？ 可能超過 35GB？

我很樂意閱讀您可能提出的任何建議，在此先感謝！

Answer 1

由於您的代碼只是直接替換，因此您可以按順序讀取所有數據並檢測 go 時需要替換的部件：

def process(fn_in, fn_out, columns):
    new_line = b'#@#@#'
    with open(fn_out, 'wb') as f_out:
        # write the header
        f_out.write((','.join(columns)+'\n').encode())
        i = 0
        with open(fn_in, "rb") as f_in:
            while (b := f_in.read(1)):
                if ord(b) == new_line[i]:
                    # keep matching the newline block
                    i += 1
                    if i == len(new_line):
                        # if matched entirely, write just a newline
                        f_out.write(b'\n')
                        i = 0
                    # write nothing while matching
                    continue
                elif i > 0:
                    # if you reach this, it was a partial match, write it
                    f_out.write(new_line[:i])
                    i = 0
                if b == b"'":
                    pass
                elif b == b"~":
                    f_out.write(b',')
                else:
                    # write the byte if no match
                    f_out.write(b)


process('my_file.txt', 'out.csv', ['a', 'b', 'c', 'd'])

這樣做很快。 您可以通過分塊閱讀來提高性能，但這仍然非常快。

與您的方法相比，這種方法的優勢在於它在 memory 中幾乎沒有任何內容，但它對優化快速讀取文件的作用很小。

編輯：在一個邊緣案例中存在一個大錯誤，我在重新閱讀后意識到，現在已修復。

Answer 2

我拿了你的示例字符串，並通過將該字符串乘以 1 億（類似於your_string*1e8 ...）來制作一個示例文件，以獲得一個 31GB 的測試文件。

遵循@Grismar 的分塊建議，我做了以下操作，它在~2 分鍾內處理該 31GB 文件，峰值 RAM 使用量取決於塊大小。

復雜的部分是跟蹤字段和記錄分隔符，它們是多個字符，肯定會跨越一個塊，因此會被截斷。

我的解決方案是檢查每個塊的末尾，看看它是否有部分分隔符。 如果是這樣，則從當前塊的末尾刪除該部分，當前塊被寫出，並且該部分成為下一個塊的開始（並且應該由下一個塊完成）：

CHUNK_SZ = 1024 * 1024

FS = "'~'"
RS = '#@#@#'

# With chars repeated in the separators, check most specific (least ambiguous)
# to least specific (most ambiguous) to definitively catch a partial with the
# fewest number of checks
PARTIAL_RSES = ['#@#@', '#@#', '#@', '#']
PARTIAL_FSES = ["'~", "'"]
ALL_PARTIALS =  PARTIAL_FSES + PARTIAL_RSES 

f_out = open('out.csv', 'w')
f_out.write('a,b,c,d\n')

f_in = open('my_file.txt')
line = ''
while True:
    # Read chunks till no more, then break out
    chunk = f_in.read(CHUNK_SZ)
    if not chunk:
        break

    # Any previous partial separator, plus new chunk
    line += chunk

    # Check end-of-line for a partial FS or RS; only when separators are more than one char
    final_partial = ''

    if line.endswith(FS) or line.endswith(RS):
        pass  # Write-out will replace complete FS or RS
    else:
        for partial in ALL_PARTIALS:
            if line.endswith(partial):
                final_partial = partial
                line = line[:-len(partial)]
                break

    # Process/write chunk
    f_out.write(line
                .replace(FS, ',')
                .replace(RS, '\n'))

    # Add partial back, to be completed next chunk
    line = final_partial


# Clean up
f_in.close()
f_out.close()

Answer 3

只是為了分享一種基於 convtools 的替代方式（表文檔| github ）。 此解決方案比 OP 更快，但比 Zach 慢約 7 倍（Zach 使用 str 塊，而這個使用行元組，通過csv.reader讀取）。

不過，這種方法可能很有用，因為它允許利用 ZF7B44CFFAFD5C52223D5498196C8A2E7BZ 處理並使用列、重新排列它們、添加新的等。

from convtools import conversion as c
from convtools.contrib.fs import split_buffer
from convtools.contrib.tables import Table

def get_rows(filename):
    with open(filename, "r") as f:
        for row in split_buffer(f, "#@#@#"):
            yield row.replace("'", "")

Table.from_csv(
    get_rows("tmp.csv"), dialect=Table.csv_dialect(delimiter="~")
).into_csv("tmp_out.csv", include_header=False)

使用 Python 進行預處理后，將 large.txt 文件（大小 >30GB）.txt 轉換為.csv 的最有效方法

問題描述

3 個解決方案

解決方案1
2 2021-12-02 23:29:12

解決方案2
2 已采納 2021-12-03 05:01:27

解決方案3
1 2022-07-16 20:43:55

使用 Python 進行預處理后，將 large.txt 文件（大小 &gt;30GB）.txt 轉換為.csv 的最有效方法

問題描述

3 個解決方案

解決方案1 2 2021-12-02 23:29:12

解決方案2 2 已采納 2021-12-03 05:01:27

解決方案3 1 2022-07-16 20:43:55

使用 Python 進行預處理后，將 large.txt 文件（大小 >30GB）.txt 轉換為.csv 的最有效方法

解決方案1
2 2021-12-02 23:29:12

解決方案2
2 已采納 2021-12-03 05:01:27

解決方案3
1 2022-07-16 20:43:55