简体   繁体   English

JDBC databricks 到 databricks 连接

[英]JDBC databricks to databricks connection

I am looking to connect to a delta lake in one databricks instance from a different databricks instance.我希望从不同的数据块实例连接到一个数据块实例中的三角洲湖。 I have downloaded the sparksimba jar from the downloads page .我已经从下载页面下载了 sparksimba jar。 When I use the following code:当我使用以下代码时:

result = spark.read.format("jdbc").option('user', 'token').option('password', <password>).option('query', query).option("url", <url>).option('driver','com.simba.spark.jdbc42.Driver').load()

I get the following error:我收到以下错误:

Py4JJavaError: An error occurred while calling o287.load.: java.lang.ClassNotFoundException: com.simba.spark.jdbc42.Driver

From reading around it seems I need to register driver-class-path, but I can't find a way where this works.从周围阅读看来,我需要注册驱动程序类路径,但我找不到可行的方法。

I have tried the following code, but the bin/pyspark dir does not exist in my databricks env:我尝试了以下代码,但我的 databricks env 中不存在 bin/pyspark 目录:

%sh bin/pyspark --driver-class-path $/dbfs/driver/simbaspark/simbaspark.jar --jars /dbfs/driver/simbaspark/simbaspark.jar

I have also tried:我也试过:

java -jar /dbfs/driver/simbaspark/simbaspark.jar

but I get this error back: no main manifest attribute, in dbfs/driver/simbaspark/simbaspark但我得到了这个错误:没有主要清单属性,在 dbfs/driver/simbaspark/simbaspark

If you want to do that (it's really not recommended), then you just need to upload this library to DBFS, and attach it to the cluster via UI or the init script .如果你想这样做(真的不推荐),那么你只需要将这个库上传到 DBFS,并通过 UI 或 init script将它附加到集群 After that it will be available for both driver & executors.之后它将可供驱动程序和执行程序使用。

But really, as I understand, your data is stored on the DBFS in the default location (so-called DBFS Root).但实际上,据我所知,您的数据存储在 DBFS 上的默认位置(所谓的 DBFS 根)。 But storing data in the DBFS Root isn't recommended, and this is pointed in the documentation:但是不建议将数据存储在 DBFS Root 中,文档中指出了这一点:

Data written to mount point paths ( /mnt ) is stored outside of the DBFS root.写入安装点路径 ( /mnt ) 的数据存储在 DBFS 根目录之外。 Even though the DBFS root is writeable, Databricks recommends that you store data in mounted object storage rather than in the DBFS root .即使 DBFS 根是可写的,Databricks 还是建议您将数据存储在挂载的对象存储中,而不是存储在 DBFS 根中 The DBFS root is not intended for production customer data . DBFS 根不适用于生产客户数据

So you need to create a separate storage account or container in existing storage account, and mount it to the Databricks workspace - this could be done to the multiple workspaces, so you'll solve the problem of data sharing between multiple workspaces.所以你需要在现有的存储帐户中创建一个单独的存储帐户或容器,并将其挂载到Databricks工作区——这可以对多个工作区进行,这样你就可以解决多个工作区之间的数据共享问题。 It's a standard recommendation for Databricks deployments in any cloud.这是在任何云中部署 Databricks 的标准建议。

Here's an example code block that I use (hope it helps)这是我使用的示例代码块(希望有帮助)

hostURL = "jdbc:mysql://xxxx.mysql.database.azure.com:3306/acme_dbuseSSL=true&requireSL=false"
databaseName = "acme_db"
tableName = "01_dim_customers"
userName = "xxxadmin@xxxmysql"
password = "xxxxxx"


df = (
   spark.read
        .format("jdbc") 
        .option("url", f"{hostURL}") 
        .option("databaseName", f"{databaseName}")
        .option("dbTable", f"{tableName}") 
        .option("user", f"{userName}") 
        .option("password", f"{password}") 
        .option("ssl", True) 
        .load() 
)

display(df)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM