简体   繁体   中英

How to loop large parquet file with generators in python?

Is it possible to open parquet files and iterate line by line, using generators? This is to avoid loading the whole parquet file into memory.

You can not iterate by line as it is not the way it is stored. You can iterate through the row-groups as following:

from fastparquet import ParquetFile
pf = ParquetFile('myfile.parq')
for df in pf.iter_row_groups():
    process sub-data-frame df

You can iterate using tensorflow_io.

import tensorflow_io as tfio

dataset = tfio.IODataset.from_parquet('myfile.parquet')

for line in dataset.take(3):
    # print the first 3 lines
    print(line)

If, as is usually the case, the Parquet is stored as multiple files in one directory, you can run:

for parquet_file in glob.glob(parquet_dir + "/*.parquet"):
    df = pd.read.parquet(parquet_file)
    for value1, value2, value3 in zip(df['col1'],df['col2'],df['col3']):
        # Process row
    del df

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM