[英]divide a parquet file into 3 parquet files using python
Is there a way to divide a huge parquet file into smaller ones (using Python)?有没有办法将一个巨大的镶木地板文件分成更小的文件(使用 Python)? Keeping all the columns and dividing rows?保留所有列并划分行? Thank you谢谢
As the answer with Dask works only for cases when the file size fits your computer's RAM, I'm going to share the script that uses Pyarrow and read the file page by page:由于 Dask 的答案仅适用于文件大小适合您计算机 RAM 的情况,因此我将分享使用 Pyarrow 的脚本并逐页读取文件:
import os
import pyarrow as pa
import pyarrow.parquet as pq
from pyarrow import Schema
class ParquetSplitter:
def __init__(self,
src_parquet_path: str,
target_dir: str,
num_chunks: int = 25
):
self._src_parquet_path = src_parquet_path
self._target_dir = target_dir
self._num_chunks = num_chunks
self._src_parquet = pq.ParquetFile(
self._src_parquet_path,
memory_map=True,
)
self._total_group_num = self._src_parquet.num_row_groups
self._schema = self._src_parquet.schema
@property
def num_row_groups(self):
print(f'Total num of groups found: {self._total_group_num}')
return self._src_parquet.num_row_groups
@property
def schema(self):
return self._schema
def read_rows(self):
for elem in self._src_parquet.iter_batches(
columns=['player_id', 'played_at']):
elem: pa.RecordBatch
print(elem.to_pydict())
def split(self):
for chunk_num, chunk_range in self._next_chunk_range():
table = self._src_parquet.read_row_groups(row_groups=chunk_range)
file_name = f'chunk_{chunk_num}.parquet'
path = os.path.join(self._target_dir, file_name)
print(f'Writing chunk #{chunk_num}...')
pq.write_table(
table=table,
where=path,
)
def _next_chunk_range(self):
upper_bound = self.num_row_groups
chunk_size = upper_bound // self._num_chunks
chunk_num = 0
low, high = 0, chunk_size
while low < upper_bound:
group_range = list(range(low, high))
yield chunk_num, group_range
chunk_num += 1
low, high = low + chunk_size, high + chunk_size
if high > upper_bound:
high = upper_bound
@staticmethod
def _get_row_hour(row: pa.RecordBatch):
return row.to_pydict()['played_at'][0].hour
if __name__ == '__main__':
splitter = BngParquetSplitter(
src_parquet_path="path/to/Parquet",
target_dir="path/to/result/dir",
num_chunks=100,
)
splitter.split()
Also, you can use Pyspark or Apache Beam Python SDK for this purpose.此外,您可以为此目的使用Pyspark或Apache Beam Python SDK。 They allow you to split the file in a more efficient way as they can be run on a multi-node cluster.它们允许您以更有效的方式拆分文件,因为它们可以在多节点集群上运行。 The example above uses a low-level Pyarrow library and utilizes one process on one machine, so the execution time can be big.上面的例子使用了一个底层的 Pyarrow 库,并且在一台机器上使用一个进程,所以执行时间可能会很大。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.