使用 Python 将 DataFrame 转换为 Azure Functionapp 中的 Parquet

Question

I am downloading 2 csv files from my Azure Data Lake storage(gen 2).我正在从我的 Azure 数据湖存储（第 2 代）下载 2 个 csv 文件。 Then merging them together and uploading it in parquet format to the same storage account, but to different folder.然后将它们合并在一起并以 parquet 格式将其上传到同一个存储帐户，但到不同的文件夹。 I want to upload the summary dataframe in parquet format to my storage account using a FuctionApp in VS Code.我想使用 VS Code 中的 FuctionApp 以镶木地板格式将摘要 dataframe 上传到我的存储帐户。 The code runs perfectly locally, but the functionapp gives me '500-internal server error'.代码在本地完美运行，但 functionapp 给了我“500-internal server error”。 There is an issue with the Pyarrow engine that I use for the to_parquet method.Azure does not seem to support this engine.我用于 to_parquet 方法的 Pyarrow 引擎存在问题。Azure 似乎不支持该引擎。

    import pandas as pd
    from azure.storage.filedatalake import DataLakeServiceClient
    import azure.functions as func
    from io import StringIO

def main(req: func.HttpRequest) -> func.HttpResponse:
    
    STORAGEACCOUNTURL= 'https://storage_acc_name.dfs.core.windows.net/'
    STORAGEACCOUNTKEY= 'Key'
    LOCALFILENAME= ['file1', 'file2']

    file1 = pd.DataFrame()
    file2 = pd.DataFrame()
   

    service_client = DataLakeServiceClient(account_url=STORAGEACCOUNTURL, credential=STORAGEACCOUNTKEY)
    adl_client_instance = service_client.get_file_system_client(file_system="raw")
    
    directory_client = adl_client_instance.get_directory_client("raw")

    for i in LOCALFILENAME:
        if i == 'file1.csv':
            file_client = adl_client_instance.get_file_client(i)
            adl_data = file_client.download_file()
            byte1 = adl_data.readall()
            s=str(byte1,'utf-8')
            file1 = pd.read_csv(StringIO(s))
            
        if i == 'file2.csv':
            file_client = adl_client_instance.get_file_client(i)
            adl_data = file_client.download_file()
            byte2 = adl_data.readall()
            s=str(byte2,'utf-8')
            file2 = pd.read_csv(StringIO(s))
    
    
    summary = pd.merge(left=file1, right=file2, on='key', how='inner')

    service_client = DataLakeServiceClient(account_url=STORAGEACCOUNTURL, credential=STORAGEACCOUNTKEY)
    file_system_client = service_client.get_file_system_client(file_system="output")
    directory_client = file_system_client.get_directory_client("output") 
    file_client = directory_client.create_file("output.parquet") 
    file_contents = pd.DataFrame(summary).to_parquet()
    file_client.append_data(data=file_contents, offset=0, length=len(file_contents)) 
    
    file_client.flush_data(len(file_contents))

    return("This HTTP triggered function executed successfully.")

if __name__ == '__main__':
    main("name")

Answer 1

maybe you can use pyspark也许你可以使用 pyspark

df_MF=spark_session.createDataFrame(df)
# now you get spark df,you can save it use spark save it

使用 Python 将 DataFrame 转换为 Azure Functionapp 中的 Parquet

问题描述

1 个解决方案

解决方案1
0 2021-12-03 09:57:20

使用 Python 将 DataFrame 转换为 Azure Functionapp 中的 Parquet

问题描述

1 个解决方案

解决方案1 0 2021-12-03 09:57:20

解决方案1
0 2021-12-03 09:57:20