簡體   English   中英

將數據幀從 azure 數據塊寫入/保存到 azure 文件共享

[英]write/save Dataframe to azure file share from azure databricks

如何從 azure 數據塊寫入 azure 文件共享引發作業。

我配置了 Hadoop 存儲鍵和值。

spark.sparkContext.hadoopConfiguration.set(
  "fs.azure.account.key.STORAGEKEY.file.core.windows.net",
  "SECRETVALUE"
)


val wasbFileShare =
    s"wasbs://testfileshare@STORAGEKEY.file.core.windows.net/testPath"

df.coalesce(1).write.mode("overwrite").csv(wasbBlob)

當嘗試將數據幀保存到 azure 文件共享時,盡管存在 URI,但我看到以下資源未找到錯誤。

 Exception in thread "main" org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: The requested URI does not represent any resource on the server.

遺憾的是,Azure 數據塊不支持讀取和寫入 Azure 文件共享。

Azure Databricks 支持的數據源: https ://docs.microsoft.com/en-us/azure/databricks/data/data-sources/

我建議您提供相同的反饋:

https://feedback.azure.com/forums/909463-azure-databricks

你在這些論壇中分享的所有反饋都將由負責構建 Azure 的 Microsoft 工程團隊監控和審查。

您可以查看解決類似問題的 SO 線程: Databricks and Azure Files

下面是將 CSV 數據直接寫入 Azure Databricks Notebook 中的 Azure blob 存儲容器的代碼片段。

# Configure blob storage account access key globally
spark.conf.set("fs.azure.account.key.chepra.blob.core.windows.net", "gv7nVIXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXdlOiA==")
output_container_path = "wasbs://sampledata@chepra.blob.core.windows.net"
output_blob_folder = "%s/wrangled_data_folder" % output_container_path

# write the dataframe as a single file to blob storage
(dataframe
 .coalesce(1)
 .write
 .mode("overwrite")
 .option("header", "true")
 .format("com.databricks.spark.csv")
 .save(output_blob_folder))

# Get the name of the wrangled-data CSV file that was just saved to Azure blob storage (it starts with 'part-')
files = dbutils.fs.ls(output_blob_folder)
output_file = [x for x in files if x.name.startswith("part-")]

# Move the wrangled-data CSV file from a sub-folder (wrangled_data_folder) to the root of the blob container
# While simultaneously changing the file name
dbutils.fs.mv(output_file[0].path, "%s/predict-transform-output.csv" % output_container_path)

在此處輸入圖片說明

從數據塊連接到 azure 文件共享的步驟

首先使用 Databricks 中的 pip install 為 Python 安裝 Microsoft Azure 存儲文件共享客戶端庫。 https://pypi.org/project/azure-storage-file-share/

安裝后,創建一個存儲帳戶。 然后您可以從數據塊創建文件共享

from azure.storage.fileshare import ShareClient

share = ShareClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>", share_name="<file share name that you want to create>")

share.create_share()

這段代碼是通過databricks將文件上傳到fileshare

from azure.storage.fileshare import ShareFileClient
 
file_client = ShareFileClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>", share_name="<your_fileshare_name>", file_path="my_file")
 
with open("./SampleSource.txt", "rb") as source_file:
    file_client.upload_file(source_file)

有關更多信息,請參閱此鏈接https://pypi.org/project/azure-storage-file-share/

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM