Save Pandas or Pyspark dataframe from Databricks to Azure Blob Storage

Question

Is there a way I can save a Pyspark or Pandas dataframe from Databricks to a blob storage without mounting or installing libraries?

I was able to achieve this after mounting the storage container into Databricks and using the library com.crealytics.spark.excel , but I was wondering if I can do the same without the library or without mounting because I will be working on clusters without these 2 permissions.

Answer 1

Here the code for saving the dataframe locally to dbfs.

# export 
from os import path

folder = "export"
name = "export"
file_path_name_on_dbfs = path.join("/tmp", folder, name)

# Writing to DBFS
# .coalesce(1) used to generate only 1 file, if the dataframe is too big this won't work so you'll have multiple files qnd you need to copy them later one by one
sampleDF \
  .coalesce(1) \
      .write \
      .mode("overwrite") \
      .option("header", "true") \
      .option("delimiter", ";") \
      .option("encoding", "UTF-8") \
      .csv(file_path_name_on_dbfs)

# path of destination, which will be sent to az storage
dest = file_path_name_on_dbfs + ".csv"

# Renaming part-000...csv to our file name
target_file = list(filter(lambda file: file.name.startswith("part-00000"), dbutils.fs.ls(file_path_name_on_dbfs)))
if len(target_file) > 0:
  dbutils.fs.mv(target_file[0].path, dest)
  dbutils.fs.cp(dest, f"file://{dest}") # this line is added for community edition only cause /dbfs is not recognized, so we copy the file locally
  dbutils.fs.rm(file_path_name_on_dbfs,True)

The code that will send the file into az storage:

import requests

sas="YOUR_SAS_TOKEN_PREVIOUSLY_CREATED" # follow the link below to create SAS token (using sas is slightly more secure than raw key storage) 
blob_account_name = "YOUR_BLOB_ACCOUNT_NAME"
container = "YOUR_CONTAINER_NAME"
destination_path_w_name = "export/export.csv"
url = f"https://{blob_account_name}.blob.core.windows.net/{container}/{destination_path_w_name}?{sas}"

# here we read the content of our previously exported df -> csv
# if you are not on community edition you might want to use /dbfs + dest
payload=open(dest).read()

headers = {
  'x-ms-blob-type': 'BlockBlob',
  'Content-Type': 'text/csv' # you can change the content type according to your needs
}

response = requests.request("PUT", url, headers=headers, data=payload)

# if response.status_code is 201 it means your file was created successfully
print(response.status_code)

Follow this link to setup a SAS token

Remember that anyone who got the sas token can access your storage depending on permissions you set while creating the sas token

Save Pandas or Pyspark dataframe from Databricks to Azure Blob Storage

Question

1 answers

solution1
0 2022-12-26 11:04:51

Save Pandas or Pyspark dataframe from Databricks to Azure Blob Storage

Question

1 answers

solution1 0 2022-12-26 11:04:51

solution1
0 2022-12-26 11:04:51