簡體   English   中英

我將如何將 .csv 轉換為 .arrow 文件而不將其全部加載到內存中?

[英]How would I go about converting a .csv to an .arrow file without loading it all into memory?

我在這里發現了一個類似的問題: Read CSV with PyArrow

在這個答案中,它引用了 sys.stdin.buffer 和 sys.stdout.buffer,但我不確定如何使用它來編寫 .arrow 文件或命名它。 我似乎無法在 pyarrow 的文檔中找到我正在尋找的確切信息。 我的文件不會有任何 nans,但它會有一個帶時間戳的索引。 該文件約為 100 GB,因此無法將其加載到內存中。 我嘗試更改代碼,但正如我所假設的,代碼最終會在每個循環中覆蓋前一個文件。

***這是我的第一篇文章。 我要感謝所有貢獻者,他們在我問他們之前就回答了我 99.9% 的其他問題。

import sys

import pandas as pd
import pyarrow as pa

SPLIT_ROWS = 1     ### used one line chunks for a small test

def main():
    writer = None
    for split in pd.read_csv(sys.stdin.buffer, chunksize=SPLIT_ROWS):

        table = pa.Table.from_pandas(split)
        # Write out to file
        with pa.OSFile('test.arrow', 'wb') as sink:     ### no append mode yet
            with pa.RecordBatchFileWriter(sink, table.schema) as writer:
                writer.write_table(table)
    writer.close()

if __name__ == "__main__":
    main()

下面是我在命令行中使用的代碼

>cat data.csv | python test.py

正如@Pace 所建議的,您應該考慮將輸出文件的創建移到讀取循環之外。 像這樣的東西:

import sys

import pandas as pd
import pyarrow as pa

SPLIT_ROWS = 1     ### used one line chunks for a small test

def main():
    # Write out to file
    with pa.OSFile('test.arrow', 'wb') as sink:     ### no append mode yet
        with pa.RecordBatchFileWriter(sink, table.schema) as writer:
            for split in pd.read_csv('data.csv', chunksize=SPLIT_ROWS):
                table = pa.Table.from_pandas(split)
                writer.write_table(table)

if __name__ == "__main__":
    main()        

如果您希望指定特定的輸入和輸出文件,您也不必使用sys.stdin.buffer 然后,您可以將腳本運行為:

python test.py

改編自@Martin-Evans 代碼的解決方案:

按照@Pace 的建議,在 for 循環之后關閉文件

import sys

import pandas as pd
import pyarrow as pa

SPLIT_ROWS = 1000000

def main():
    schema = pa.Table.from_pandas(pd.read_csv('Data.csv',nrows=2)).schema 
    ### reads first two lines to define schema 

    with pa.OSFile('test.arrow', 'wb') as sink:
        with pa.RecordBatchFileWriter(sink, schema) as writer:            
            for split in pd.read_csv('Data.csv',chunksize=SPLIT_ROWS):
                table = pa.Table.from_pandas(split)
                writer.write_table(table)

            writer.close()

if __name__ == "__main__":
    main()   

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM