简体繁体 English

通过 Python 中的 Databricks api 读取 Databricks 表？

[英]Read a Databricks table via Databricks api in Python?

原文 2021-03-19 18:16:32 4 2 python-3.x/ pyspark/ databricks

Using Python-3, I am trying to compare an Excel (xlsx) sheet to an identical spark table in Databricks.使用 Python-3，我试图将 Excel (xlsx) 表与 Databricks 中的相同 spark 表进行比较。 I want to avoid doing the compare in Databricks.我想避免在 Databricks 中进行比较。 So I am looking for a way to read the spark table via the Databricks api.所以我正在寻找一种通过 Databricks api 读取火花表的方法。 Is this possible?这可能吗？ How can I go on to read a table: DB.TableName?如何在 go 上读取表：DB.TableName？

2 个解决方案

I can recomend you write pyspark code in notebook, call the notebook from previously defined job, and establish connection between your local machine and databricks workspace.我可以推荐你在 notebook 中编写 pyspark 代码，从之前定义的作业中调用 notebook，并在本地机器和 databricks 工作区之间建立连接。

You could perfom comaprision directly on spark or convert data frames to pandas if you wish.如果您愿意，您可以直接在 spark 上执行comaprision 或将数据帧转换为 pandas。 If noteebok will end comaprision, could retrun result from particular job.如果noteebok 将结束comaprision，可能会返回特定作业的结果。 I think that sending all databricks tables could be impossible because of API limitation you have spark cluster to perform complex operation, API should be use to send small messages.我认为发送所有数据块表可能是不可能的，因为 API 限制你有火花集群来执行复杂的操作，API 应该用于发送小消息。

Officical documentation: https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/jobs#--runs-get-output官方文档： https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/jobs#--runs-get-output

Retrieve the output and metadata of a run.检索 output 和运行的元数据。 When a notebook task returns a value through the dbutils.notebook.exit() call, you can use this endpoint to retrieve that value.当笔记本任务通过 dbutils.notebook.exit() 调用返回值时，您可以使用此端点检索该值。 Azure Databricks restricts this API to return the first 5 MB of the output. Azure Databricks 限制此 API 返回 output 的前 5 MB。 For returning a larger result, you can store job results in a cloud storage service.要返回更大的结果，您可以将作业结果存储在云存储服务中。

There is no way to read the table from the DB API as far as I am aware unless you run it as a job as LaTreb already mentioned.据我所知，没有办法从 DB API 中读取该表，除非您将其作为 LaTreb 已经提到的作业运行。 However, if you really wanted to, you could use either the ODBC or JDBC drivers to get the data through your databricks cluster.但是，如果您真的想要，您可以使用 ODBC 或 JDBC 驱动程序通过您的数据块集群获取数据。

Information on how to set this up can be found here .可以在此处找到有关如何设置的信息。

Once you have the DSN set up you can use pyodbc to connect to databricks and run a query.设置 DSN 后，您可以使用pyodbc连接到数据块并运行查询。 At this time the ODBC driver will only allow you to run Spark-SQL commands.此时 ODBC 驱动程序将只允许您运行 Spark-SQL 命令。

All that being said, it will probably still be easier to just load the data into Databricks, unless you have some sort of security concern.话虽如此，除非您有某种安全问题，否则将数据加载到 Databricks 中可能会更容易。