简体   繁体   English

我将如何将 .csv 转换为 .arrow 文件而不将其全部加载到内存中?

[英]How would I go about converting a .csv to an .arrow file without loading it all into memory?

I found a similar question here: Read CSV with PyArrow我在这里发现了一个类似的问题: Read CSV with PyArrow

In this answer it references sys.stdin.buffer and sys.stdout.buffer, but I am not exactly sure how that would be used to write the .arrow file, or name it.在这个答案中,它引用了 sys.stdin.buffer 和 sys.stdout.buffer,但我不确定如何使用它来编写 .arrow 文件或命名它。 I can't seem to find the exact information I am looking for in the docs for pyarrow.我似乎无法在 pyarrow 的文档中找到我正在寻找的确切信息。 My file will not have any nans, but it will have a timestamped index.我的文件不会有任何 nans,但它会有一个带时间戳的索引。 The file is ~100 gb, so loading it into memory simply isn't an option.该文件约为 100 GB,因此无法将其加载到内存中。 I tried changing the code, but as I assumed, the code ended up overwriting the previous file every loop.我尝试更改代码,但正如我所假设的,代码最终会在每个循环中覆盖前一个文件。

***This is my first post. ***这是我的第一篇文章。 I would like to thank all the contributors who answered 99.9% of my other questions before I had even the asked them.我要感谢所有贡献者,他们在我问他们之前就回答了我 99.9% 的其他问题。

import sys

import pandas as pd
import pyarrow as pa

SPLIT_ROWS = 1     ### used one line chunks for a small test

def main():
    writer = None
    for split in pd.read_csv(sys.stdin.buffer, chunksize=SPLIT_ROWS):

        table = pa.Table.from_pandas(split)
        # Write out to file
        with pa.OSFile('test.arrow', 'wb') as sink:     ### no append mode yet
            with pa.RecordBatchFileWriter(sink, table.schema) as writer:
                writer.write_table(table)
    writer.close()

if __name__ == "__main__":
    main()

Below is the code I used in the command line下面是我在命令行中使用的代码

>cat data.csv | python test.py

As suggested by @Pace, you should consider moving the output file creation outside of the reading loop.正如@Pace 所建议的,您应该考虑将输出文件的创建移到读取循环之外。 Something like this:像这样的东西:

import sys

import pandas as pd
import pyarrow as pa

SPLIT_ROWS = 1     ### used one line chunks for a small test

def main():
    # Write out to file
    with pa.OSFile('test.arrow', 'wb') as sink:     ### no append mode yet
        with pa.RecordBatchFileWriter(sink, table.schema) as writer:
            for split in pd.read_csv('data.csv', chunksize=SPLIT_ROWS):
                table = pa.Table.from_pandas(split)
                writer.write_table(table)

if __name__ == "__main__":
    main()        

You also don't have to use sys.stdin.buffer if you would prefer to specify specific input and output files.如果您希望指定特定的输入和输出文件,您也不必使用sys.stdin.buffer You could then just run the script as:然后,您可以将脚本运行为:

python test.py

Solution adapted from @Martin-Evans code:改编自@Martin-Evans 代码的解决方案:

Closed file after the for loop as suggested by @Pace按照@Pace 的建议,在 for 循环之后关闭文件

import sys

import pandas as pd
import pyarrow as pa

SPLIT_ROWS = 1000000

def main():
    schema = pa.Table.from_pandas(pd.read_csv('Data.csv',nrows=2)).schema 
    ### reads first two lines to define schema 

    with pa.OSFile('test.arrow', 'wb') as sink:
        with pa.RecordBatchFileWriter(sink, schema) as writer:            
            for split in pd.read_csv('Data.csv',chunksize=SPLIT_ROWS):
                table = pa.Table.from_pandas(split)
                writer.write_table(table)

            writer.close()

if __name__ == "__main__":
    main()   

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 我将如何处理多个文件异常? - How would i go about handling multiple file exceptions? 我将如何 go 关于使用 python 制作“安全”.exe 文件? - How would I go about making a “Safe” .exe file with python? 如何使用python转置/枢转csv文件,而无需将整个文件加载到内存中? - How do I transpose/pivot a csv file with python *without* loading the whole file into memory? 我如何 go 关于在循环中读取 csv 文件 - How do I go about reading a csv file while in a loop 您将如何计算 CSV 文件中包含 Python 中每个唯一值的行数? - How would you go about counting the number of rows in a CSV file which contain each unique value in Python? 如何在 python 中打开一个 csv 文件,一次读取一行,而不将整个 csv 文件加载到内存中? - How can I open a csv file in python, and read one line at a time, without loading the whole csv file in memory? Python-我将如何去做? - Python - How would I go about doing this? 我将如何根据 python 中的用户输入访问 CSV 中的特定元素? - How would I go about accessing a specific element in a CSV based off of user input in python? 如何在.csv文件中分离数据? - How to go about separating data in a .csv file? 在没有精灵的基本乒乓游戏中,我将如何检测桨和球之间的碰撞 - How would i go about detecting collision between the paddle and ball in a basic game of pong without sprites
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM