简体   繁体   English

"使用 pyarrow 如何附加到镶木地板文件?"

[英]Using pyarrow how do you append to parquet file?

How do you append/update to a parquet file with pyarrow ?如何使用pyarrow追加/更新parquet文件?

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


 table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
 table3 = pd.DataFrame({'six': [-1, np.nan, 2.5], 'nine': ['foo', 'bar', 'baz'], 'ten': [True, False, True]})


pq.write_table(table2, './dataNew/pqTest2.parquet')
#append pqTest2 here?  

There is nothing I found in the docs about appending parquet files.我在文档中没有找到关于附加镶木地板文件的任何内容。 And, Can you use pyarrow with multiprocessing to insert/update the data.而且,您能否使用pyarrow和多处理来插入/更新数据。

I ran into the same issue and I think I was able to solve it using the following:我遇到了同样的问题,我想我可以使用以下方法解决它:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


chunksize=10000 # this is the number of lines

pqwriter = None
for i, df in enumerate(pd.read_csv('sample.csv', chunksize=chunksize)):
    table = pa.Table.from_pandas(df)
    # for the first chunk of records
    if i == 0:
        # create a parquet write object giving it an output file
        pqwriter = pq.ParquetWriter('sample.parquet', table.schema)            
    pqwriter.write_table(table)

# close the parquet writer
if pqwriter:
    pqwriter.close()

In your case the column name is not consistent, I made the column name consistent for three sample dataframes and the following code worked for me.在您的情况下,列名不一致,我使三个示例数据帧的列名保持一致,以下代码对我有用。

# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


def append_to_parquet_table(dataframe, filepath=None, writer=None):
    """Method writes/append dataframes in parquet format.

    This method is used to write pandas DataFrame as pyarrow Table in parquet format. If the methods is invoked
    with writer, it appends dataframe to the already written pyarrow table.

    :param dataframe: pd.DataFrame to be written in parquet format.
    :param filepath: target file location for parquet file.
    :param writer: ParquetWriter object to write pyarrow tables in parquet format.
    :return: ParquetWriter object. This can be passed in the subsequenct method calls to append DataFrame
        in the pyarrow Table
    """
    table = pa.Table.from_pandas(dataframe)
    if writer is None:
        writer = pq.ParquetWriter(filepath, table.schema)
    writer.write_table(table=table)
    return writer


if __name__ == '__main__':

    table1 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
    table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
    table3 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
    writer = None
    filepath = '/tmp/verify_pyarrow_append.parquet'
    table_list = [table1, table2, table3]

    for table in table_list:
        writer = append_to_parquet_table(table, filepath, writer)

    if writer:
        writer.close()

    df = pd.read_parquet(filepath)
    print(df)

Output:输出:

   one  three  two
0 -1.0   True  foo
1  NaN  False  bar
2  2.5   True  baz
0 -1.0   True  foo
1  NaN  False  bar
2  2.5   True  baz
0 -1.0   True  foo
1  NaN  False  bar
2  2.5   True  baz

Generally speaking, Parquet datasets consist of multiple files, so you append by writing an additional file into the same directory where the data belongs to.一般来说,Parquet 数据集由多个文件组成,因此您可以通过将附加文件写入数据所属的同一目录中来追加。 It would be useful to have the ability to concatenate multiple files easily.能够轻松连接多个文件会很有用。 I opened https://issues.apache.org/jira/browse/PARQUET-1154 to make this possible to do easily in C++ (and therefore Python)我打开了https://issues.apache.org/jira/browse/PARQUET-1154以便在 C++(以及 Python)中轻松实现

Demo of appending a Pandas dataframe to an existing .parquet file.将 Pandas 数据框附加到现有 .parquet 文件的演示。

Tested on Python v3.9 on Windows and Linux.在 Windows 和 Linux 上的 Python v3.9 上测试。

Install PyArrow using pip:使用 pip 安装 PyArrow:

pip install pyarrow==6.0.1

Or Anaconda / Miniconda :Anaconda / Miniconda

conda install -c conda-forge pyarrow=6.0.1 -y

Demo code:演示代码:

# Q. Demo?
# A. Demo of appending to an existing .parquet file by memory mapping the original file, appending the new dataframe, then writing the new file out.

import os
import numpy as np
import pandas as pd
import pyarrow as pa  
import pyarrow.parquet as pq  

filepath = "parquet_append.parquet"

Method 1 of 2方法 1 之 2

Simple way: Using pandas, read the orignal .parquet file in, append, write entire file back out.简单的方法:使用熊猫,读入原始的.parquet文件,追加,写回整个文件。

# Create parquet file.
df = pd.DataFrame({"x": [1.,2.,np.nan], "y": ["a","b","c"]})  # Create dataframe ...
df.to_parquet(filepath)  # ... write to file.

# Append to original parquet file.
df = pd.read_parquet(filepath)  # Read original ...
df2 = pd.DataFrame({"x": [3.,4.,np.nan], "y": ["d","e","f"]})  # ... create new dataframe to append ...
df3 = pd.concat([df, df2])  # ... concatenate together ...
df3.to_parquet(filepath)  # ... overwrite original file.

# Demo that new data frame has been appended to old.
df_copy = pd.read_parquet(filepath)
print(df_copy)
#      x  y
# 0  1.0  a
# 1  2.0  b
# 2  NaN  c
# 0  3.0  d
# 1  4.0  e
# 2  NaN  f

Method 2 of 2方法 2 之 2

More complex but faster: using native PyArrow calls, memory map the original file, append the new dataframe, write new file out.更复杂但更快:使用原生 PyArrow 调用,内存映射原始文件,追加新数据帧,写出新文件。

# Write initial file using PyArrow.
df = pd.DataFrame({"x": [1.,2.,np.nan], "y": ["a","b","c"]})  # Create dataframe ...
table = pa.Table.from_pandas(df)
pq.write_table(table, where=filepath)

def parquet_append(filepath:Path or str, df: pd.DataFrame) -> None:
    """
    Append to dataframe to existing .parquet file. Reads original .parquet file in, appends new dataframe, writes new .parquet file out.
    :param filepath: Filepath for parquet file.
    :param df: Pandas dataframe to append. Must be same schema as original.
    """
    table_to_append = pa.Table.from_pandas(df)
    table_original_file = pq.read_table(source=filepath,  pre_buffer=False, use_threads=True, memory_map=True)  # Use memory map for speed.
    handle = pq.ParquetWriter(filepath, table.schema)  # Create new empty file file. WARNING: PRODUCTION LEVEL CODE SHOULD WRITE TO A TEMPORARY FILE THEN RENAME AT END THEN DELETE THE OLD.
    handle.write_table(table_original_file)
    handle.write_table(table_to_append)
    handle.close()  # Writes binary footer. Until this occurs, .parquet file is not usable.

# Append to original parquet file.
df = pd.DataFrame({"x": [3.,4.,np.nan], "y": ["d","e","f"]})  # ... create new dataframe to append ...
parquet_append(filepath, df)

# Demo that new data frame has been appended to old.
df_copy = pd.read_parquet(filepath)
print(df_copy)
#      x  y
# 0  1.0  a
# 1  2.0  b
# 2  NaN  c
# 0  3.0  d
# 1  4.0  e
# 2  NaN  f

Discussion讨论

The answers from @Ibraheem Ibraheem and @yardstick17 have one caveat: the .parquet file cannot be read until .close() is called (it will throw an exception as the binary footer is missing) and after .close() is called, the files cannot be appended to. @Ibraheem Ibraheem 和 @yardstick17 的答案有一个警告:在.close()之前无法读取 .parquet 文件(由于缺少二进制页脚,它将引发异常)并且在.close()之后,文件不能附加到。 In other words, they are not a complete solution for appending to an existing .parquet file.换句话说,它们不是附加到现有 .parquet 文件的完整解决方案。

It would be possible to modify this to merge multiple .parquet files in a folder into a single .parquet file.可以对其进行修改以将文件夹中的多个 .parquet 文件合并为一个 .parquet 文件。

After extensive research, I believe that it is not possible to append to an existing .parquet file with the existing libraries.经过广泛的研究,我相信不可能将现有库附加到现有的 .parquet 文件中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM