[英]How would I go about converting a .csv to an .arrow file without loading it all into memory?
我在這里發現了一個類似的問題: Read CSV with PyArrow
在這個答案中,它引用了 sys.stdin.buffer 和 sys.stdout.buffer,但我不確定如何使用它來編寫 .arrow 文件或命名它。 我似乎無法在 pyarrow 的文檔中找到我正在尋找的確切信息。 我的文件不會有任何 nans,但它會有一個帶時間戳的索引。 該文件約為 100 GB,因此無法將其加載到內存中。 我嘗試更改代碼,但正如我所假設的,代碼最終會在每個循環中覆蓋前一個文件。
***這是我的第一篇文章。 我要感謝所有貢獻者,他們在我問他們之前就回答了我 99.9% 的其他問題。
import sys
import pandas as pd
import pyarrow as pa
SPLIT_ROWS = 1 ### used one line chunks for a small test
def main():
writer = None
for split in pd.read_csv(sys.stdin.buffer, chunksize=SPLIT_ROWS):
table = pa.Table.from_pandas(split)
# Write out to file
with pa.OSFile('test.arrow', 'wb') as sink: ### no append mode yet
with pa.RecordBatchFileWriter(sink, table.schema) as writer:
writer.write_table(table)
writer.close()
if __name__ == "__main__":
main()
下面是我在命令行中使用的代碼
>cat data.csv | python test.py
正如@Pace 所建議的,您應該考慮將輸出文件的創建移到讀取循環之外。 像這樣的東西:
import sys
import pandas as pd
import pyarrow as pa
SPLIT_ROWS = 1 ### used one line chunks for a small test
def main():
# Write out to file
with pa.OSFile('test.arrow', 'wb') as sink: ### no append mode yet
with pa.RecordBatchFileWriter(sink, table.schema) as writer:
for split in pd.read_csv('data.csv', chunksize=SPLIT_ROWS):
table = pa.Table.from_pandas(split)
writer.write_table(table)
if __name__ == "__main__":
main()
如果您希望指定特定的輸入和輸出文件,您也不必使用sys.stdin.buffer
。 然后,您可以將腳本運行為:
python test.py
改編自@Martin-Evans 代碼的解決方案:
按照@Pace 的建議,在 for 循環之后關閉文件
import sys
import pandas as pd
import pyarrow as pa
SPLIT_ROWS = 1000000
def main():
schema = pa.Table.from_pandas(pd.read_csv('Data.csv',nrows=2)).schema
### reads first two lines to define schema
with pa.OSFile('test.arrow', 'wb') as sink:
with pa.RecordBatchFileWriter(sink, schema) as writer:
for split in pd.read_csv('Data.csv',chunksize=SPLIT_ROWS):
table = pa.Table.from_pandas(split)
writer.write_table(table)
writer.close()
if __name__ == "__main__":
main()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.