简体   繁体   English

无法从databricks pyspark worker写入Azure Sql DataWarehouse

[英]Can't write to Azure Sql DataWarehouse from databricks pyspark workers

I am trying to simply write data to azure sql DataWarehouse, while using azure blob storage for staging. 我正在尝试使用Azure Blob存储进行暂存时将数据简单地写入Azure SQL DataWarehouse。

There is a very straight forward tutorial at azure databricks documentation azure/sql-data-warehouse , which works, if you follow it step by step. 如果您一步一步地遵循它,那么在azure databricks文档azure / sql-data-warehouse上有一个非常简单的教程。

However in my scenario, I have to do the writing from a worker that is executing a foreach. 但是在我的场景中,我必须由正在执行foreach的工作人员来编写。

Here some links related to the issue: 这里有一些与此问题相关的链接:

error-using-pyspark-with-wasb-connecting-pyspark-with-azure-blob 错误使用pyspark与wasb连接pyspark与天蓝色斑点

github.com/Azure/mmlspark/issues/456 github.com/Azure/mmlspark/issues/456

pyspark-java-io-ioexception-no-filesystem-for-scheme-https pyspark-java-io-ioexception-no-filesystem-for-scheme-https

So, this code below WORKS : 因此,下面的代码在WORKS

  spark = SparkSession.builder.getOrCreate()      
  spark.conf.set("fs.azure.account.key.<storageAccountName>.blob.core.windows.net", "myKey")  
  df = spark.createDataFrame([(1, 2, 3, 4), (5, 6, 7, 8)], ('a', 'b', 'c', 'd'))  

  (df.write 
  .format("com.databricks.spark.sqldw") 
  .option("url", "jdbc:sqlserver:...") 
  .option("user", "user@server") 
  .option("password", "pass") 
  .option("forwardSparkAzureStorageCredentials", "true") 
  .option("dbTable", "dbo.table_teste") 
  .option("tempDir", "wasbs://<container>@<storageAccountName>.blob.core.windows.net/") 
  .mode("append")
  .save())

However it fails when I insert the code above inside a foreach, just like below: 但是,当我将上面的代码插入到foreach中时,它失败,如下所示:

from pyspark.sql.session import SparkSession
from pyspark.sql import Row

spark = SparkSession.builder.getOrCreate()          

def iterate(row):
   # The code above

dfIter = spark.createDataFrame([(1, 2, 3, 4)], ('a', 'b', 'c', 'd'))
dfIter.rdd.foreach(iterate)

Executing it will generate this exception: 执行它将生成此异常:

py4j.protocol.Py4JJavaError: An error occurred while calling o54.save. py4j.protocol.Py4JJavaError:调用o54.save时发生错误。 : com.databricks.spark.sqldw.SqlDWConnectorException: Exception encountered in SQL DW connector code. :com.databricks.spark.sqldw.SqlDWConnectorException:SQL DW连接器代码遇到异常。

Caused by: java.io.IOException: No FileSystem for scheme: wasbs 原因:java.io.IOException:方案:wasbs没有文件系统

I have had the same kind of issue when saving on delta tables: pyspark-saving-is-not-working-when-called-from-inside-a-foreach 在增量表上进行保存时,我遇到过同样的问题: 当从foreach内部调用时,pyspark-保存不起作用

But in that case, I just needed to setup '/dbfs/' at the begining of the delta table location, so the worker would be able to save it in the right place. 但是在那种情况下,我只需要在增量表位置的开头设置“ / dbfs /”,这样工作人员就可以将其保存在正确的位置。

Based on that, I believe something is missing in the worker, and that is why it is not properly executing this saving. 基于此,我认为工作者中缺少某些东西,这就是为什么它无法正确执行此保存的原因。 Maybe a library that I should setup into spark config. 也许我应该将其设置为spark配置的库。

I also looked into databricks community: save-the-results-of-a-query-to-azure-blo and they managed to solve the issue by setting this config: 我还研究了databricks社区: 将查询结果保存到Azure Blo,他们通过设置以下配置设法解决了该问题:

sc.hadoopConfiguration.set("fs.wasbs.impl","org.apache.hadoop.fs.azure.NativeAzureFileSystem")

PySpark: PySpark:

spark.sparkContext._jsc.hadoopConfiguration().set("fs.wasbs.impl","org.apache.hadoop.fs.azure.NativeAzureFileSystem")

But it didn't work and I got this exception: 但这没有用,我得到了这个例外:

Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem not found 引起原因:java.lang.RuntimeException:java.lang.ClassNotFoundException:找不到类org.apache.hadoop.fs.azure.NativeAzureFileSystem

org.apache.hadoop:hadoop-azure:3.2.0 is installed. 已安装org.apache.hadoop:hadoop-azure:3.2.0。

Well, any help? 好吧,有什么帮助吗?

I believe your main issue is that you are trying to write from within a foreach loop, which basically renders any kind of batching/scaling moot - which is what the SQL DW connector is designed for. 我相信您的主要问题是,您尝试从foreach循环中进行编写,该循环基本上会呈现任何类型的批处理/扩展性模拟问题-这是SQL DW连接器的设计目的。 If you really need to write out from within the loop and the data volume is not too huge you might be able to achieve this using the simple JDBC connector: https://docs.databricks.com/spark/latest/data-sources/sql-databases.html 如果您确实需要从循环中写出数据并且数据量不是太大,则可以使用简单的JDBC连接器来实​​现: https : //docs.databricks.com/spark/latest/data-sources/ sql-databases.html

But still note that SQL DW is really optimized for large scale write, not for single row ingestion. 但仍需注意,SQL DW实际上是针对大规模写入而优化的,而不是针对单行提取的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM