我们可以将数据从 pandas dataframe 加载到没有 spark.sql 的数据块表中吗

Question

I have a requirement, to write the data from csv/pandas dataframe to databricks table.我有一个要求，将 csv/pandas dataframe 中的数据写入数据块表。 My python code may not be running on databricks cluster.我的 python 代码可能未在 databricks 集群上运行。 I may be running on an isolated standalone node.我可能在一个孤立的独立节点上运行。 I am using databricks python connector to select the data from databricks table.我正在使用 databricks python 连接器连接到 select 数据砖表中的数据。 selects are working.选择正在工作。 But I am unable to load the data from csv or pandas dataframe to databricks.但我无法将数据从 csv 或 pandas dataframe 加载到数据块。

Can I use databricks python connector to load the bulk data in csv/pandas dataframe into databricks table?我可以使用 databricks python 连接器将 csv/pandas dataframe 中的批量数据加载到 databricks 表中吗？

Below is the code snippet for getting the databricks connection and performing selects on standalone node using databricks-python connector.下面是使用 databricks-python 连接器获取 databricks 连接并在独立节点上执行选择的代码片段。

from databricks import sql
conn = sql.connect(server_hostname=self.server_name,
                           http_path=self.http_path,
                           access_token=self.access_token
                           )
try:
    with conn.cursor() as cursor:
        cursor.execute(qry)
        return cursor.fetchall_arrow().to_pandas()
except Exception as e:
    print("Exception Occurred:" + str(e))

Note: My csv file is on Azure ADLS Gen2 storage.注意：我的 csv 文件位于 Azure ADLS Gen2 存储上。 I am reading this file to create a pandas dataframe.我正在阅读此文件以创建 pandas dataframe。 All I need is to either load the data from pandas to Databricks delta table or read csv file and load the data to delta table.我需要的只是将数据从 pandas 加载到 Databricks 增量表或读取 csv 文件并将数据加载到增量表。 Can this be achieved using databricks-python connector instead of using spark?这可以使用 databricks-python 连接器而不是使用 spark 来实现吗？

Answer 1

Can this be achieved using databricks-python connector instead of using spark?这可以使用databricks-python连接器而不是使用 spark 来实现吗？

The Databricks SQL Connector for Python is a Python library that allows you to use Python code to run SQL commands on Databricks clusters and Databricks SQL warehouses. The Databricks SQL Connector for Python is a Python library that allows you to use Python code to run SQL commands on Databricks clusters and Databricks SQL warehouses.

So, there isn't any scope with Databricks SQL connector for python to convert the Pandas Dataframe to Delta lake. So, there isn't any scope with Databricks SQL connector for python to convert the Pandas Dataframe to Delta lake.

Coming to the second part of your question that if there any other way to convert pandas Dataframe to Delta table without using spark.sql .来到你问题的第二部分，如果有任何其他方法可以在不使用spark.sql的情况下将 pandas Dataframe 转换为 Delta 表。

Since Delta lake is tied with Spark, there isn't any possible way as far as I know which allows you to convert pandas Dataframe to delta table without using spark.由于 Delta 湖与 Spark 绑定，据我所知，没有任何可能的方法可以让您在不使用 spark 的情况下将 pandas Dataframe 转换为 delta 表。

Alternatively, I suggest you to read the file as spark Dataframe and then convert it into Delta format using below code.或者，我建议您将文件读取为 spark Dataframe，然后使用以下代码将其转换为Delta格式。

val file_location = "/mnt/tables/data.csv"

val df = spark.read.format("csv")
  .option("inferSchema", "true")
  .option("header", "true")
  .option("sep", ",")
  .load(file_location)

df.write.mode("overwrite").format("delta").saveAsTable(table_name)

我们可以将数据从 pandas dataframe 加载到没有 spark.sql 的数据块表中吗

问题描述

1 个解决方案

解决方案1
0 2022-08-19 16:24:29

我们可以将数据从 pandas dataframe 加载到没有 spark.sql 的数据块表中吗

问题描述

1 个解决方案

解决方案1 0 2022-08-19 16:24:29

解决方案1
0 2022-08-19 16:24:29