简体   繁体   中英

write/save Dataframe to azure file share from azure databricks

How to write to azure file share from azure databricks spark jobs.

I configured the Hadoop storage key and values.

spark.sparkContext.hadoopConfiguration.set(
  "fs.azure.account.key.STORAGEKEY.file.core.windows.net",
  "SECRETVALUE"
)


val wasbFileShare =
    s"wasbs://testfileshare@STORAGEKEY.file.core.windows.net/testPath"

df.coalesce(1).write.mode("overwrite").csv(wasbBlob)

When tried to save the dataframe to azure file share I'm seeing the following the resource not found error although the URI is present.

 Exception in thread "main" org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: The requested URI does not represent any resource on the server.

Unfortunately, Azure databricks do not support reading and writing to Azure File Share.

Azure Databricks supported data sources: https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/

I would suggest you to provide feedback on the same:

https://feedback.azure.com/forums/909463-azure-databricks

All of the feedback you share in these forums will be monitored and reviewed by the Microsoft engineering teams responsible for building Azure.

You may checkout SO thread which addressing similar issue: Databricks and Azure Files

Below is the code snippet for writing CSV data directly to an Azure blob storage container in an Azure Databricks Notebook.

# Configure blob storage account access key globally
spark.conf.set("fs.azure.account.key.chepra.blob.core.windows.net", "gv7nVIXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXdlOiA==")
output_container_path = "wasbs://sampledata@chepra.blob.core.windows.net"
output_blob_folder = "%s/wrangled_data_folder" % output_container_path

# write the dataframe as a single file to blob storage
(dataframe
 .coalesce(1)
 .write
 .mode("overwrite")
 .option("header", "true")
 .format("com.databricks.spark.csv")
 .save(output_blob_folder))

# Get the name of the wrangled-data CSV file that was just saved to Azure blob storage (it starts with 'part-')
files = dbutils.fs.ls(output_blob_folder)
output_file = [x for x in files if x.name.startswith("part-")]

# Move the wrangled-data CSV file from a sub-folder (wrangled_data_folder) to the root of the blob container
# While simultaneously changing the file name
dbutils.fs.mv(output_file[0].path, "%s/predict-transform-output.csv" % output_container_path)

在此处输入图片说明

Steps to connect to azure file share from databricks

first install Microsoft Azure Storage File Share client library for Python using pip install in Databricks. https://pypi.org/project/azure-storage-file-share/

after installing, create a storage account. Then you can create a fileshare from databricks

from azure.storage.fileshare import ShareClient

share = ShareClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>", share_name="<file share name that you want to create>")

share.create_share()

This code is to upload a file into fileshare through databricks

from azure.storage.fileshare import ShareFileClient
 
file_client = ShareFileClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>", share_name="<your_fileshare_name>", file_path="my_file")
 
with open("./SampleSource.txt", "rb") as source_file:
    file_client.upload_file(source_file)

Refer this link for further information https://pypi.org/project/azure-storage-file-share/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM