简体繁体中英

What is the least memory-intensive way to read a Parquet file in Python? Is line-by-line possible?

原文 2022-08-04 21:46:04 5 1 python/ parquet/ pyarrow/ fastparquet

I'm writing a lambda to read records stored in Parquet files, restructure them into a partition_key: {json_record} format, and submit the records to a Kafka queue. I'm wondering if there's any way to do this without reading the entire table into memory at once.

I've tried using the iter_row_groups method from the fastparquet library, but my records only have one row group, so I'm still loading the entire table into memory. And I noticed that the BufferReader from pyarrow has a readlines method, but it isn't implemented. Is true line-by-line reading of Parquet not possible?

Might be worth pointing out that I'm working with Parquet files stored in S3, so ideally a solution would be able to read in StreamingBody

1 answers

I suggest you may look into DuckDB and polars:

DuckDB https://duckdb.org/2021/06/25/querying-parquet.html

One can certainly limit the query to the say top 1000 results. If you got some row index iterating through the whole parquet with duckdb and SELECT WHERE should be easy.

polars https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_parquet.html

You may experiment with row_count_name and row_count_offset. Again, with an existing row index column reading rows as chunks is doable.

Read file line-by-line or store in memory?

Read CSV file line-by-line python

Read a file line-by-line instead of read file into memory

Is it possible to read a file line-by-line in while also skipping a given number of lines Python

How should I read a file line-by-line in Python?

Read a text file line-by-line into a string with python

Python line-by-line memory profiler?

How to read a file line-by-line into a list?

Clearing memory between memory-intensive procedures in Python

In python, what is a functional, and memory efficient way to read standard in, line by line?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Read file line-by-line or store in memory? Read CSV file line-by-line python Read a file line-by-line instead of read file into memory Is it possible to read a file line-by-line in while also skipping a given number of lines Python How should I read a file line-by-line in Python? Read a text file line-by-line into a string with python Python line-by-line memory profiler? How to read a file line-by-line into a list? Clearing memory between memory-intensive procedures in Python In python, what is a functional, and memory efficient way to read standard in, line by line?

Related Tags

What is the least memory-intensive way to read a Parquet file in Python? Is line-by-line possible?

Question

1 answers

solution1 0 2022-08-09 14:21:24

solution1
0 2022-08-09 14:21:24