简体   繁体   English

为什么在我使用 com.databricks.spark.avro 时,必须添加 org.apache.spark.avro 依赖项才能在 Spark2.4 中读/写 avro 文件?

[英]Why is adding org.apache.spark.avro dependency is mandatory to read/write avro files in Spark2.4 while I'm using com.databricks.spark.avro?

I tried to run my Spark/Scala code 2.3.0 on a Cloud Dataproc cluster 1.4 where there's Spark 2.4.8 installed.我尝试在安装了 Spark 2.4.8 的 Cloud Dataproc 集群 1.4 上运行我的 Spark/Scala 代码 2.3.0。 I faced an error concerning the reading of avro files.我遇到了关于读取 avro 文件的错误。 Here's my code:这是我的代码:

sparkSession.read.format("com.databricks.spark.avro").load(input)

This code failed as expected.此代码按预期失败。 Then I added this dependency to my pom.xml file:然后我将此依赖项添加到我的pom.xml文件中:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-avro_2.11</artifactId>
    <version>2.4.0</version>
</dependency>

Which made my code run successfully.这使我的代码成功运行。 And this is the part that I don't understand, I'm still using the module com.databricks.spark.avro in my code.这是我不明白的部分,我仍然在我的代码中使用模块com.databricks.spark.avro Why is adding org.apache.spark.avro dependency solved my problem, knowing that I'm not really using it in my code?为什么添加org.apache.spark.avro依赖项解决了我的问题,知道我并没有真正在我的代码中使用它?

I was expecting that I will need to change my code to something like this:我期待我需要将我的代码更改为这样的:

sparkSession.read.format("avro").load(input)

This is historic artifact of the fact that initially Spark Avro support was added by Databricks in their proprietary Spark Runtime as com.databricks.spark.avro format, when Sark Avro support was added to open-source Spark as avro format then, for backward compatibility, support of the com.databricks.spark.avro format was retained if spark.sql.legacy.replaceDatabricksSparkAvro.enabled property is set to true :这是一个历史性的事实,即最初由 Databricks 在其专有的 Spark 运行时中以com.databricks.spark.avro格式添加 Spark Avro 支持,当时将 Sark Avro 支持作为avro格式添加到开源 Spark,以实现向后兼容性, 如果spark.sql.legacy.replaceDatabricksSparkAvro.enabled属性设置为true则保留对com.databricks.spark.avro格式的支持:

If it is set to true, the data source provider com.databricks.spark.avro is mapped to the built-in but external Avro data source module for backward compatibility.如果设置为 true,则数据源提供程序 com.databricks.spark.avro 将映射到内置但外部的 Avro 数据源模块以实现向后兼容性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM