简体   繁体   English

在 Python 中逐行写入镶木地板

[英]Write to parquet row by row in Python

I obtain messages in async cycle and from each message I parse row which is dictionary.我在异步周期中获取消息,并从每条消息中解析作为字典的row I would like to write these rows into parquet.我想将这些行写入镶木地板。 To implement this, I do the following:为了实现这一点,我执行以下操作:

fields = [('A', pa.float64()), ('B', pa.float64()), ('C', pa.float64()), ('D', pa.float64())]
schema = pa.schema(fields)
pqwriter = pq.ParquetWriter('sample.parquet', schema=schema, compression='gzip')

#async cycle starts here
async for message in messages:
   row = {'A': message[1], 'B': message[2], 'C': message[3], 'D': message[4]}
   table = pa.Table.from_pydict(row)
   pqwriter.write_table(table)
#end of async cycle
pqwriter.close()

Everything works perfect, however the resulting parquet-file is about ~5 Mb size, whereas if I perform writing to csv-file, I have the file of ~200 Kb size.一切都很完美,但是生成的镶木地板文件大小约为 5 Mb,而如果我执行写入 csv 文件,则文件大小约为 200 Kb。 I have checked that data types are the same (columns of csv are floatt, columns of parquet are floats)我检查过数据类型是否相同(csv 的列是浮点数,镶木地板的列是浮点数)

Why my parquet is much larger than csv with the same data?为什么我的镶木地板比 csv 大得多,数据相同?

Parquet is a columnar format which is optimized to write batches of data. Parquet 是一种列格式,已针对写入批量数据进行了优化。 It is not meant to be used to write data row by row.它不是用来逐行写入数据的。

It is not well suited for your use case.它不太适合您的用例。 You may want to write intermediate rows of data in a more suitable format (say avro, csv) and then convert data in batches to parquet.您可能希望以更合适的格式(例如 avro、csv)编写中间数据行,然后将数据批量转换为 parquet。

I have achieved the desired resuls as follows:我已经达到了预期的结果,如下所示:

chunksize = 1e6
data = []
fields = #list of tuples
schema = pa.schema(fields)

with pq.ParquetWriter('my_parquet', schema=schema) as writer:
#async cycle starts here
rows = #dict with structure as in fields
data.extend(rows)

if len(data)>chunksize:
   data = pd.DataFrame(data)
   table = pa.Table.from_pandas(data, schema=schema)
   writer.write_table(table)
   data = []
#end of async cycle
if len(data)!=0:
   data = pd.DataFrame(data)
   table = pa.Table.from_pandas(data, schema=schema)
   writer.write_table(table)
writer.close()

This code snipped does actually what I need.这段代码剪断了我真正需要的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM