简体   繁体   English

直接从 Azure Databricks 访问 Azure DevOps Git 文件

[英]Accessing Azure DevOps Git file directly from Azure Databricks

We have a CSV file stored in a ADO (Azure DevOps) Git repository.我们有一个 CSV 文件存储在 ADO (Azure DevOps) Git 存储库中。 I have Azure Databricks cluster running, and in the workspace I have a python code to read and transform this CSV file into a spark dataframe. I have Azure Databricks cluster running, and in the workspace I have a python code to read and transform this CSV file into a spark dataframe. But every time the file undergoes change, I have to manually download it from ADO Git and upload to the Databricks workspace.但是每次文件发生更改时,我都必须从 ADO Git 手动下载并上传到 Databricks 工作区。 I use the following command to verify that the file has been uploaded:-我使用以下命令来验证文件是否已上传:-

dbutils.fs.ls ("/FileStore/tables")

It lists my file.它列出了我的文件。 I then use the following Python code to convert this CSV to Spark dataframe:然后我使用以下 Python 代码将此 CSV 转换为 Spark dataframe:

file_location = "/FileStore/tables/MyFile.csv"
file_type = "csv"
# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","
# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

So there is this manual step involved every time the file in the ADO Git repository changes.因此,每次 ADO Git 存储库中的文件更改时,都会涉及此手动步骤。 Is there any Python function using which I can directly point to the copy of the file in the master branch of the ADO Git?是否有任何 Python function 使用它可以直接指向 ADO Git 的主分支中的文件副本?

You have 2 choices, depending on what would be simpler for you:您有 2 个选择,具体取决于对您来说更简单的方法:

  1. Use Azure DevOps Python API to access file (called item in API) inside the Git tree.使用Azure DevOps Python API访问 Z0BCC70105AD247B56B3 内的文件(在 API 中调用的项目) Because this file will be accessed only from driver node, then you will need to use dbutils.fs.cp to copy file from driver node into /FileStore/tables因为这个文件只能从驱动节点访问,所以你需要使用dbutils.fs.cp将文件从驱动节点复制到/FileStore/tables
  2. Setup a build pipeline inside your Git repository, that will be triggered only on commit of specific file, and if it changes, use Databricks CLI ( databrics fs cp... command) to copy file directly into DBFS.在 Git 存储库中设置构建管道,该管道仅在提交特定文件时触发,如果发生更改,请使用Databricks CLIdatabrics fs cp...命令)将文件直接复制到 DBFS。 Here is an example that not doing exactly what you want, but it could be used as inspiration.这是一个没有完全按照您的意愿行事的示例,但它可以用作灵感。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将文件从 Azure 文件加载到 Azure Databricks - Load file from Azure Files to Azure Databricks Azure DevOps:如何使用 REST API 从 git 下载文件? - Azure DevOps: How to download a file from git using REST API? 来自 Databricks 的文件元数据,例如 timein Azure 存储 - File metadata such as timein Azure Storage from Databricks 无法从 python 文件运行 azure 数据块 - Not able to run azure databricks from python file Azure - 如何从 Azure Databricks 文件存储下载文件? - Azure - How to dowload a file from Azure Databricks Filestore? 在 Azure DevOps Git 存储库中使用来自 Azure Pipelines 的 Python 包版本标记 Git 存储库 - Tag Git repo with Python Package version from Azure Pipelines in Azure DevOps Git repo 从Azure数据块到Azure表存储的Connectiong - Connectiong to Azure table storage from Azure databricks 使用 python 脚本在 Azure DevOps 管道上调用 Databricks API 失败,但从本地计算机在 Postman 上成功运行 - Databricks API call fails on Azure DevOps pipelines using python script, but run successfully on Postman from local machine 从数据块连接到 azure 突触 - Connection from databricks to azure synapse 从数据湖重命名 Azure Databricks 中的文件时出现问题 - Problem when rename file in Azure Databricks from a data lake
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM