繁体   English   中英

将 Pandas 或 Pyspark dataframe 从 Databricks 保存到 Azure Blob 存储

[英]Save Pandas or Pyspark dataframe from Databricks to Azure Blob Storage

有没有一种方法可以将 Databricks 中的 Pyspark 或 Pandas dataframe 保存到 blob 存储中,而无需安装或安装库?

在将存储容器安装到 Databricks 并使用库com.crealytics.spark.excel后,我能够实现这一点,但我想知道我是否可以在没有库或没有安装的情况下做同样的事情,因为我将在没有这些的集群上工作2 权限。

这里是将 dataframe 本地保存到 dbfs 的代码。

# export 
from os import path

folder = "export"
name = "export"
file_path_name_on_dbfs = path.join("/tmp", folder, name)

# Writing to DBFS
# .coalesce(1) used to generate only 1 file, if the dataframe is too big this won't work so you'll have multiple files qnd you need to copy them later one by one
sampleDF \
  .coalesce(1) \
      .write \
      .mode("overwrite") \
      .option("header", "true") \
      .option("delimiter", ";") \
      .option("encoding", "UTF-8") \
      .csv(file_path_name_on_dbfs)

# path of destination, which will be sent to az storage
dest = file_path_name_on_dbfs + ".csv"

# Renaming part-000...csv to our file name
target_file = list(filter(lambda file: file.name.startswith("part-00000"), dbutils.fs.ls(file_path_name_on_dbfs)))
if len(target_file) > 0:
  dbutils.fs.mv(target_file[0].path, dest)
  dbutils.fs.cp(dest, f"file://{dest}") # this line is added for community edition only cause /dbfs is not recognized, so we copy the file locally
  dbutils.fs.rm(file_path_name_on_dbfs,True)

将文件发送到 az 存储的代码:

import requests

sas="YOUR_SAS_TOKEN_PREVIOUSLY_CREATED" # follow the link below to create SAS token (using sas is slightly more secure than raw key storage) 
blob_account_name = "YOUR_BLOB_ACCOUNT_NAME"
container = "YOUR_CONTAINER_NAME"
destination_path_w_name = "export/export.csv"
url = f"https://{blob_account_name}.blob.core.windows.net/{container}/{destination_path_w_name}?{sas}"

# here we read the content of our previously exported df -> csv
# if you are not on community edition you might want to use /dbfs + dest
payload=open(dest).read()

headers = {
  'x-ms-blob-type': 'BlockBlob',
  'Content-Type': 'text/csv' # you can change the content type according to your needs
}

response = requests.request("PUT", url, headers=headers, data=payload)

# if response.status_code is 201 it means your file was created successfully
print(response.status_code)

按照此链接设置 SAS 令牌

请记住,任何获得 sas 令牌的人都可以访问您的存储,具体取决于您在创建 sas 令牌时设置的权限

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM