简体   繁体   中英

What is the least memory-intensive way to read a Parquet file in Python? Is line-by-line possible?

I'm writing a lambda to read records stored in Parquet files, restructure them into a partition_key: {json_record} format, and submit the records to a Kafka queue. I'm wondering if there's any way to do this without reading the entire table into memory at once.

I've tried using the iter_row_groups method from the fastparquet library, but my records only have one row group, so I'm still loading the entire table into memory. And I noticed that the BufferReader from pyarrow has a readlines method, but it isn't implemented. Is true line-by-line reading of Parquet not possible?

Might be worth pointing out that I'm working with Parquet files stored in S3, so ideally a solution would be able to read in StreamingBody

I suggest you may look into DuckDB and polars:

One can certainly limit the query to the say top 1000 results. If you got some row index iterating through the whole parquet with duckdb and SELECT WHERE should be easy.

You may experiment with row_count_name and row_count_offset. Again, with an existing row index column reading rows as chunks is doable.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM