简体   繁体   English

使用python将parquet文件分成3个parquet文件

[英]divide a parquet file into 3 parquet files using python

Is there a way to divide a huge parquet file into smaller ones (using Python)?有没有办法将一个巨大的镶木地板文件分成更小的文件(使用 Python)? Keeping all the columns and dividing rows?保留所有列并划分行? Thank you谢谢

you can do it with dask .你可以用dask 来做

import dask.dataframe as dd

ddf = dd.read_parquet('my_file.parquet')
ddf.repartition(3).to_parquet('my_files/')

edit : you need to install either fastparquet or pyarrow编辑:您需要安装fastparquetpyarrow

As the answer with Dask works only for cases when the file size fits your computer's RAM, I'm going to share the script that uses Pyarrow and read the file page by page:由于 Dask 的答案仅适用于文件大小适合您计算机 RAM 的情况,因此我将分享使用 Pyarrow 的脚本并逐页读取文件:

import os
import pyarrow as pa
import pyarrow.parquet as pq
from pyarrow import Schema


class ParquetSplitter:

    def __init__(self,
                 src_parquet_path: str,
                 target_dir: str,
                 num_chunks: int = 25
                 ):
        self._src_parquet_path = src_parquet_path
        self._target_dir = target_dir
        self._num_chunks = num_chunks

        self._src_parquet = pq.ParquetFile(
            self._src_parquet_path,
            memory_map=True,
        )

        self._total_group_num = self._src_parquet.num_row_groups
        self._schema = self._src_parquet.schema

    @property
    def num_row_groups(self):
        print(f'Total num of groups found: {self._total_group_num}')
        return self._src_parquet.num_row_groups

    @property
    def schema(self):
        return self._schema

    def read_rows(self):
        for elem in self._src_parquet.iter_batches(
                columns=['player_id', 'played_at']):
            elem: pa.RecordBatch
            print(elem.to_pydict())

    def split(self):
        for chunk_num, chunk_range in self._next_chunk_range():
            table = self._src_parquet.read_row_groups(row_groups=chunk_range)
            file_name = f'chunk_{chunk_num}.parquet'
            path = os.path.join(self._target_dir, file_name)
            print(f'Writing chunk #{chunk_num}...')
            pq.write_table(
                table=table,
                where=path,
            )

    def _next_chunk_range(self):
        upper_bound = self.num_row_groups
        
        chunk_size = upper_bound // self._num_chunks

        chunk_num = 0
        low, high = 0, chunk_size
        while low < upper_bound:
            group_range = list(range(low, high))
            
            yield chunk_num, group_range
            chunk_num += 1
            low, high = low + chunk_size, high + chunk_size
            if high > upper_bound:
                high = upper_bound

    @staticmethod
    def _get_row_hour(row: pa.RecordBatch):
        return row.to_pydict()['played_at'][0].hour


if __name__ == '__main__':
    splitter = BngParquetSplitter(
        src_parquet_path="path/to/Parquet",
        target_dir="path/to/result/dir",
        num_chunks=100,
    )
    splitter.split()

Also, you can use Pyspark or Apache Beam Python SDK for this purpose.此外,您可以为此目的使用PysparkApache Beam Python SDK。 They allow you to split the file in a more efficient way as they can be run on a multi-node cluster.它们允许您以更有效的方式拆分文件,因为它们可以在多节点集群上运行。 The example above uses a low-level Pyarrow library and utilizes one process on one machine, so the execution time can be big.上面的例子使用了一个底层的 Pyarrow 库,并且在一台机器上使用一个进程,所以执行时间可能会很大。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM