Databricks not saving dataframes as Parquet properly in the blob storage

Question

I am using Databricks with a mounted blob storage. When I execute my Python notebook which creates large pandas DataFrame and tries to store them as.parquet files they show up having 0 bytes.

The saving takes place in a submodule that I import and not in the main notebook itself. The strange this is that saving the dataframe as a parquet file always stores it as an empty file, ie with 0 bytes. However, if I try to save a dataframe as a.parquet file in the main notebook itself, it works.

The problem seems to be very similar to this issue: https://community.databricks.com/s/question/0D58Y00009MIWkfSAH/how-can-i-save-a-parquet-file-using-pandas-with-a-data-factory-orchestrated-notebook

I have installed both pyarrow and pandas and try to save a dataframe as follows:

df.to_parquet("blob storage location.parquet", index=False, engine="pyarrow")

Everything works fine locally but running this in Databricks is causing issues. I first tried to save my dataframes as HDF5 files, but the saving process doesn't work in Databricks it seems. I then switched to Parquet but I am running into the issue mentioned below.

Does anyone have a solution or an explanation as to why this is happening?

Answer 1

I tried to reproduce the same in my environment and I got below results :

This is my sample mount location path /mnt/io243

dbutils.fs.mount(
    source = "wasbs://<container_Name>@<storage_account_name>.blob.core.windows.net/",
    mount_point = "/mnt/<mount_name>",
    extra_configs = {"fs.azure.account.key.<storage_account_name>.blob.core.windows.net":"Access_key"})

在此处输入图像描述

NOTE: As you can see, this is my mount path /mnt/io243 .If I used the same mount path without adding /dbfs it stores as an empty file. So Use mount path like this Syntax: /dbfs/mnt/io243/<file_name>.parquet .Make sure to install fsspec use this command %pip install fsspec .

I successfully got the file into the destination location using the below code.

from pyspark.sql.types import StructType,StructField, StringType, IntegerType

import pandas as pd
#%pip install fsspec

#sample datafram
my_data = [
            ("vamsi","1","M",2000),
            ("saideep","2","M",3000),
            ("rakesh","3","M",4000)
          ]

schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("id", StringType(), True), \
    StructField("gender", StringType(), True), \
    StructField("salary", IntegerType(), True) \
  ])

df = spark.createDataFrame(data=my_data,schema=schema)

df1 = df.toPandas()

df1.to_parquet("/dbfs/mnt/io243/def1.parquet",index=False, engine="pyarrow")

在此处输入图像描述

Yes, You can check whether inside mount location file is stored or not. Please follow this code.

dbutils.fs.ls('<mount_path>')

在此处输入图像描述

Databricks not saving dataframes as Parquet properly in the blob storage

Question

1 answers

solution1
0 2023-01-20 08:00:27

Databricks not saving dataframes as Parquet properly in the blob storage

Question

1 answers

solution1 0 2023-01-20 08:00:27

solution1
0 2023-01-20 08:00:27