使用python将parquet文件分成3个parquet文件

Question

Is there a way to divide a huge parquet file into smaller ones (using Python)?有没有办法将一个巨大的镶木地板文件分成更小的文件（使用 Python）？ Keeping all the columns and dividing rows?保留所有列并划分行？ Thank you谢谢

Answer 1

you can do it with dask .你可以用dask 来做。

import dask.dataframe as dd

ddf = dd.read_parquet('my_file.parquet')
ddf.repartition(3).to_parquet('my_files/')

edit : you need to install either fastparquet or pyarrow编辑：您需要安装fastparquet或pyarrow

Answer 2

As the answer with Dask works only for cases when the file size fits your computer's RAM, I'm going to share the script that uses Pyarrow and read the file page by page:由于 Dask 的答案仅适用于文件大小适合您计算机 RAM 的情况，因此我将分享使用 Pyarrow 的脚本并逐页读取文件：

import os
import pyarrow as pa
import pyarrow.parquet as pq
from pyarrow import Schema


class ParquetSplitter:

    def __init__(self,
                 src_parquet_path: str,
                 target_dir: str,
                 num_chunks: int = 25
                 ):
        self._src_parquet_path = src_parquet_path
        self._target_dir = target_dir
        self._num_chunks = num_chunks

        self._src_parquet = pq.ParquetFile(
            self._src_parquet_path,
            memory_map=True,
        )

        self._total_group_num = self._src_parquet.num_row_groups
        self._schema = self._src_parquet.schema

    @property
    def num_row_groups(self):
        print(f'Total num of groups found: {self._total_group_num}')
        return self._src_parquet.num_row_groups

    @property
    def schema(self):
        return self._schema

    def read_rows(self):
        for elem in self._src_parquet.iter_batches(
                columns=['player_id', 'played_at']):
            elem: pa.RecordBatch
            print(elem.to_pydict())

    def split(self):
        for chunk_num, chunk_range in self._next_chunk_range():
            table = self._src_parquet.read_row_groups(row_groups=chunk_range)
            file_name = f'chunk_{chunk_num}.parquet'
            path = os.path.join(self._target_dir, file_name)
            print(f'Writing chunk #{chunk_num}...')
            pq.write_table(
                table=table,
                where=path,
            )

    def _next_chunk_range(self):
        upper_bound = self.num_row_groups
        
        chunk_size = upper_bound // self._num_chunks

        chunk_num = 0
        low, high = 0, chunk_size
        while low < upper_bound:
            group_range = list(range(low, high))
            
            yield chunk_num, group_range
            chunk_num += 1
            low, high = low + chunk_size, high + chunk_size
            if high > upper_bound:
                high = upper_bound

    @staticmethod
    def _get_row_hour(row: pa.RecordBatch):
        return row.to_pydict()['played_at'][0].hour


if __name__ == '__main__':
    splitter = BngParquetSplitter(
        src_parquet_path="path/to/Parquet",
        target_dir="path/to/result/dir",
        num_chunks=100,
    )
    splitter.split()

Also, you can use Pyspark or Apache Beam Python SDK for this purpose.此外，您可以为此目的使用Pyspark或Apache Beam Python SDK。 They allow you to split the file in a more efficient way as they can be run on a multi-node cluster.它们允许您以更有效的方式拆分文件，因为它们可以在多节点集群上运行。 The example above uses a low-level Pyarrow library and utilizes one process on one machine, so the execution time can be big.上面的例子使用了一个底层的 Pyarrow 库，并且在一台机器上使用一个进程，所以执行时间可能会很大。

使用python将parquet文件分成3个parquet文件

问题描述

2 个解决方案

解决方案1
3 2018-07-11 11:08:54

解决方案2
1 2022-06-01 17:20:30

使用python将parquet文件分成3个parquet文件

问题描述

2 个解决方案

解决方案1 3 2018-07-11 11:08:54

解决方案2 1 2022-06-01 17:20:30

解决方案1
3 2018-07-11 11:08:54

解决方案2
1 2022-06-01 17:20:30