使用python將parquet文件分成3個parquet文件

Question

有沒有辦法將一個巨大的鑲木地板文件分成更小的文件（使用 Python）？ 保留所有列並划分行？ 謝謝

Answer 1

你可以用dask 來做。

import dask.dataframe as dd

ddf = dd.read_parquet('my_file.parquet')
ddf.repartition(3).to_parquet('my_files/')

編輯：您需要安裝fastparquet或pyarrow

Answer 2

由於 Dask 的答案僅適用於文件大小適合您計算機 RAM 的情況，因此我將分享使用 Pyarrow 的腳本並逐頁讀取文件：

import os
import pyarrow as pa
import pyarrow.parquet as pq
from pyarrow import Schema


class ParquetSplitter:

    def __init__(self,
                 src_parquet_path: str,
                 target_dir: str,
                 num_chunks: int = 25
                 ):
        self._src_parquet_path = src_parquet_path
        self._target_dir = target_dir
        self._num_chunks = num_chunks

        self._src_parquet = pq.ParquetFile(
            self._src_parquet_path,
            memory_map=True,
        )

        self._total_group_num = self._src_parquet.num_row_groups
        self._schema = self._src_parquet.schema

    @property
    def num_row_groups(self):
        print(f'Total num of groups found: {self._total_group_num}')
        return self._src_parquet.num_row_groups

    @property
    def schema(self):
        return self._schema

    def read_rows(self):
        for elem in self._src_parquet.iter_batches(
                columns=['player_id', 'played_at']):
            elem: pa.RecordBatch
            print(elem.to_pydict())

    def split(self):
        for chunk_num, chunk_range in self._next_chunk_range():
            table = self._src_parquet.read_row_groups(row_groups=chunk_range)
            file_name = f'chunk_{chunk_num}.parquet'
            path = os.path.join(self._target_dir, file_name)
            print(f'Writing chunk #{chunk_num}...')
            pq.write_table(
                table=table,
                where=path,
            )

    def _next_chunk_range(self):
        upper_bound = self.num_row_groups
        
        chunk_size = upper_bound // self._num_chunks

        chunk_num = 0
        low, high = 0, chunk_size
        while low < upper_bound:
            group_range = list(range(low, high))
            
            yield chunk_num, group_range
            chunk_num += 1
            low, high = low + chunk_size, high + chunk_size
            if high > upper_bound:
                high = upper_bound

    @staticmethod
    def _get_row_hour(row: pa.RecordBatch):
        return row.to_pydict()['played_at'][0].hour


if __name__ == '__main__':
    splitter = BngParquetSplitter(
        src_parquet_path="path/to/Parquet",
        target_dir="path/to/result/dir",
        num_chunks=100,
    )
    splitter.split()

此外，您可以為此目的使用Pyspark或Apache Beam Python SDK。 它們允許您以更有效的方式拆分文件，因為它們可以在多節點集群上運行。 上面的例子使用了一個底層的 Pyarrow 庫，並且在一台機器上使用一個進程，所以執行時間可能會很大。

使用python將parquet文件分成3個parquet文件

問題描述

2 個解決方案

解決方案1
3 2018-07-11 11:08:54

解決方案2
1 2022-06-01 17:20:30

使用python將parquet文件分成3個parquet文件

問題描述

2 個解決方案

解決方案1 3 2018-07-11 11:08:54

解決方案2 1 2022-06-01 17:20:30

解決方案1
3 2018-07-11 11:08:54

解決方案2
1 2022-06-01 17:20:30