直接从 Azure Databricks 访问 Azure DevOps Git 文件

Question

We have a CSV file stored in a ADO (Azure DevOps) Git repository.我们有一个 CSV 文件存储在 ADO (Azure DevOps) Git 存储库中。 I have Azure Databricks cluster running, and in the workspace I have a python code to read and transform this CSV file into a spark dataframe. I have Azure Databricks cluster running, and in the workspace I have a python code to read and transform this CSV file into a spark dataframe. But every time the file undergoes change, I have to manually download it from ADO Git and upload to the Databricks workspace.但是每次文件发生更改时，我都必须从 ADO Git 手动下载并上传到 Databricks 工作区。 I use the following command to verify that the file has been uploaded:-我使用以下命令来验证文件是否已上传：-

dbutils.fs.ls ("/FileStore/tables")

It lists my file.它列出了我的文件。 I then use the following Python code to convert this CSV to Spark dataframe:然后我使用以下 Python 代码将此 CSV 转换为 Spark dataframe：

file_location = "/FileStore/tables/MyFile.csv"
file_type = "csv"
# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","
# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

So there is this manual step involved every time the file in the ADO Git repository changes.因此，每次 ADO Git 存储库中的文件更改时，都会涉及此手动步骤。 Is there any Python function using which I can directly point to the copy of the file in the master branch of the ADO Git?是否有任何 Python function 使用它可以直接指向 ADO Git 的主分支中的文件副本？

Answer 1

You have 2 choices, depending on what would be simpler for you:您有 2 个选择，具体取决于对您来说更简单的方法：

Use Azure DevOps Python API to access file (called item in API) inside the Git tree.使用Azure DevOps Python API访问 Z0BCC70105AD247B56B3 内的文件（在 API 中调用的项目） Because this file will be accessed only from driver node, then you will need to use dbutils.fs.cp to copy file from driver node into /FileStore/tables因为这个文件只能从驱动节点访问，所以你需要使用dbutils.fs.cp将文件从驱动节点复制到/FileStore/tables
Setup a build pipeline inside your Git repository, that will be triggered only on commit of specific file, and if it changes, use Databricks CLI ( databrics fs cp... command) to copy file directly into DBFS.在 Git 存储库中设置构建管道，该管道仅在提交特定文件时触发，如果发生更改，请使用Databricks CLI （ databrics fs cp...命令）将文件直接复制到 DBFS。 Here is an example that not doing exactly what you want, but it could be used as inspiration.这是一个没有完全按照您的意愿行事的示例，但它可以用作灵感。

直接从 Azure Databricks 访问 Azure DevOps Git 文件

问题描述

1 个解决方案

解决方案1
0 2021-02-06 08:37:55

直接从 Azure Databricks 访问 Azure DevOps Git 文件

问题描述

1 个解决方案

解决方案1 0 2021-02-06 08:37:55

解决方案1
0 2021-02-06 08:37:55